> ps> This is a recommendation I would give even when you purchase > ps> non-cheap battery backed hardware RAID controllers (I won't > ps> mention any names or details to avoid bashing as I'm sure it's > ps> not specific to the particular vendor I had problems with most > ps> recently). > > This again? If you're sure the device is broken, then I think others > would like to know it, even if all devices are broken.
The problem is that I even had help from the vendor in question, and it was not for me personally but for a company, and I don't want to use information obtained that way to do any public bashing. But I have no particular indication that there is any problem with the vendor in general; it was a combination of choices made by Linux kernel developers and the behavior of the RAID controller. My interpretation was that no one was there looking at the big picture, and the end result was that if you followed the instructions specifically given by the vendor, you would have a setup whereby you would loose correctness whenever the BBU was overheated/broken/disabled. The alternative was to get completely piss-poor performance by not being able to take advantage of the battery backed nature of the cache at all (which defeats most of the purpose of having the controller, if you use it in any kind of transactional database environment or similar). > but, fine. Anyway, how did you determine the device was broken? By performing timing tests as mentioned in the other post that you answered separately, and after detecting the problem confirming the status with respect to caching at the different levels as claimed by the administrative tool for the controller. While timing tests cannot conclusively prove correct behavior, it can definitely proove incorrect behavior in cases where your timings are simply theoretically impossible given the physical nature of the underlying drives. > At > least you can tell us that much without fear of retaliation (whether > baseless or founded), and maybe others can use the same test to > independently discover what you did which would be both fair and safe > for you. The test was trivial; in my case a ~10 line Python script or something along those lines. Perhaps I should just go ahead and release something which non-programmers can easily run and draw conclusions from. > This is the real problem as I see it---a bunch of FUD, without any > actual resolution beyond ``it's working, I _think_, and in any case > the random beatings have stopped so D'OH-NT TOUCH *ANY*THING! THAR BE > DEMONZ IN THE BOWELS O DIS DISK SHELF!'' I'd love to go on a public rant, because I think the whole situation was a perfect example of a case where a single competent person who actually cares about correctness could have pinpointed this problem trivially. But instead you have different camps doing their own stuff and not considering the big picture. > If anyone asks questions, they get no actual information, but a huge > amount of blame heaped on the sysadmin. Your post is a great example > of the typical way this problem is handled because it does both: deny > information and blame the sysadmin. Though I'm really picking on you > way too much here. Hopefully everyone's starting to agree, though, we > do need a real way out of this mess! I'm not quite sure what you're referring to here. I'm not blaming any sysadmin. I was trying to point out *TO* sysadmins, to help them, that I recommend being paranoid about correctness. If you mean the original poster in the thread having issues, I am not blaming him *at all* in the post you responded to. It was strictly meant as a comment in response to the poster who noted that he discovered, to his surprise, the problems with VirtualBox. I wanted to make the point that while I completely understand his surprise, I have come to expect that these things are broken by default (regardless of whether you're using virtualbox or not, or vendor X or Y etc), and that care should be taken if you do want to have correctness when it comes to write barriers and/or honoring fsync(). However, that said, as I stated in another post I wouldn't be surprised if it turns out the USB device was ignoring sync commands. But I have no idea what the case was for the original poster, nor have I even followed the thread in detail enough to know if that would even be a possible explanation for his problems. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schul...@infidyne.com>' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org
pgpjMmPNbbRnF.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss