>     ps> This is a recommendation I would give even when you purchase
>     ps> non-cheap battery backed hardware RAID controllers (I won't
>     ps> mention any names or details to avoid bashing as I'm sure it's
>     ps> not specific to the particular vendor I had problems with most
>     ps> recently).
> 
> This again?  If you're sure the device is broken, then I think others
> would like to know it, even if all devices are broken.  

The problem is that I even had help from the vendor in question, and
it was not for me personally but for a company, and I don't want to
use information obtained that way to do any public bashing.

But I have no particular indication that there is any problem with the
vendor in general; it was a combination of choices made by Linux
kernel developers and the behavior of the RAID controller. My
interpretation was that no one was there looking at the big picture,
and the end result was that if you followed the instructions
specifically given by the vendor, you would have a setup whereby you
would loose correctness whenever the BBU was
overheated/broken/disabled.

The alternative was to get completely piss-poor performance by not
being able to take advantage of the battery backed nature of the cache
at all (which defeats most of the purpose of having the controller, if
you use it in any kind of transactional database environment or
similar).

> but, fine.  Anyway, how did you determine the device was broken? 

By performing timing tests as mentioned in the other post that you
answered separately, and after detecting the problem confirming the
status with respect to caching at the different levels as claimed by
the administrative tool for the controller.

While timing tests cannot conclusively prove correct behavior, it can
definitely proove incorrect behavior in cases where your timings are
simply theoretically impossible given the physical nature of the
underlying drives.

> At
> least you can tell us that much without fear of retaliation (whether
> baseless or founded), and maybe others can use the same test to
> independently discover what you did which would be both fair and safe
> for you.

The test was trivial; in my case a ~10 line Python script or something
along those lines. Perhaps I should just go ahead and release
something which non-programmers can easily run and draw conclusions
from.

> This is the real problem as I see it---a bunch of FUD, without any
> actual resolution beyond ``it's working, I _think_, and in any case
> the random beatings have stopped so D'OH-NT TOUCH *ANY*THING!  THAR BE
> DEMONZ IN THE BOWELS O DIS DISK SHELF!''

I'd love to go on a public rant, because I think the whole situation
was a perfect example of a case where a single competent person who
actually cares about correctness could have pinpointed this problem
trivially. But instead you have different camps doing their own stuff
and not considering the big picture.

> If anyone asks questions, they get no actual information, but a huge
> amount of blame heaped on the sysadmin.  Your post is a great example
> of the typical way this problem is handled because it does both: deny
> information and blame the sysadmin.  Though I'm really picking on you
> way too much here.  Hopefully everyone's starting to agree, though, we
> do need a real way out of this mess!

I'm not quite sure what you're referring to here. I'm not blaming any
sysadmin. I was trying to point out *TO* sysadmins, to help them, that
I recommend being paranoid about correctness.

If you mean the original poster in the thread having issues, I am not
blaming him *at all* in the post you responded to. It was strictly
meant as a comment in response to the poster who noted that he
discovered, to his surprise, the problems with VirtualBox. I wanted to
make the point that while I completely understand his surprise, I have
come to expect that these things are broken by default (regardless of
whether you're using virtualbox or not, or vendor X or Y etc), and
that care should be taken if you do want to have correctness when it
comes to write barriers and/or honoring fsync().

However, that said, as I stated in another post I wouldn't be
surprised if it turns out the USB device was ignoring sync
commands. But I have no idea what the case was for the original
poster, nor have I even followed the thread in detail enough to know
if that would even be a possible explanation for his problems.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schul...@infidyne.com>'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org

Attachment: pgpjMmPNbbRnF.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to