Re: Failover for unattached USB device

Dmitry Katsubo Thu, 25 Oct 2018 02:47:30 -0700

On 2018-10-24 20:05, Chris Murphy wrote:

I think about the best we can expect in the short term is that Btrfs
goes read-only before the file system becomes corrupted in a way it
can't recover with a normal mount. And I'm not certain it is in this
state of development right now for all cases. And I say the same thing
for other file systems as well.


Running Btrfs on USB devices is fine, so long as they're well behaved.
I have such a setup with USB 3.0 devices. Perhaps I got a bit lucky,
because there are a lot of known bugs with USB controllers, USB bridge
chipsets, and USB hubs.

Having user definable switches for when to go read-only is, I think
misleading to the user, and very likely will mislead the file system.
The file system needs to go read-only when it gets confused, period.
It doesn't matter what the error rate is.


In general I agree. I just wonder why it couldn't happen quicker. For
example, from the log I've originally attached one can see that btrfs
made 1867 attempts to read (perhaps the same) block from both devices
in RAID1 volume, without success:

BTRFS error (device sdf): bdev /dev/sdh errs: wr 0, rd 1867, flush 0,corrupt 0, gen 0BTRFS error (device sdf): bdev /dev/sdg errs: wr 0, rd 1867, flush 0,corrupt 0, gen 0


Attempts lasted for 29 minutes.

The work around is really to do the hard work making the devices
stable. Not asking Btrfs to paper over known unstable hardware.

In my case, I started out with rare disconnects and resets with
directly attached drives. This was a couple years ago. It was a Btrfs
raid1 setup, and the drives would not go missing at the same time, but
both would just drop off from time to time. Btrfs would complain of
dropped writes, I vaguely remember it going read only. But normal
mounts worked, sometimes with scary errors but always finding a good
copy on the other drive, and doing passive fixups. Scrub would always
fix up the rest. I'm still using those same file systems on those
devices, but now they go through a dyconn USB 3.0 hub with a decently
good power supply. I originally thought the drop offs were power
related, so I explicitly looked for a USB hub that could supply at
least 2A, and this one is 12VDC @ 2500mA. A laptop drive will draw
nearly 1A on spin up, but at that point P=AV. Laptop drives during
read/write using 1.5 W to 2.5 W @ 5VDC.

1.5-2.5 W = A * 5 V
Therefore A = 0.3-0.5A

And for 4 drives at possibly 0.5 A (although my drives are all at the
1.6 W read/write), that's 2 A @ 5 V, which is easily maintained for
the hub power supply (which by my calculation could do 6 A @ 5 V, not
accounting for any resistance).

Anyway, as it turns out I don't think it was power related, as the
Intel NUC in question probably had just enough amps per port. And what
it really was, was incompatibility between the Intel controller and
the bridgechipset in the USB-SATA cases, and the USB hub is similar to
an ethernet hub, it actually reads the USB stream and rewrites it out.
So hubs are actually pretty complicated little things, and having a
good one matters.


Thanks for this information. I have a situation similar to yours, with
only important difference that my drives are put into the USB dock with
independent power and cooling like this one:

https://www.ebay.com/itm/Mediasonic-ProBox-4-Bay-3-5-Hard-Drive-Enclosure-USB-3-0-eSATA-Sata-3-6-0Gbps/273161164246

so I don't think I need to worry about amps. This dock is connected
directly to USB port on the motherboard.

However indeed there could be bugs both on dock side and in southbridge.More over I could imagine that USB reset happens due to another USBdevice,

like a wave stated in one place turning into tsunami for the whole
USB subsystem.

There are pending patches for something similar that you can find in
the archives. I think the reason they haven't been merged yet is there
haven't been enough comments and feedback (?). I think Anand Jain is
the author of those patches so you might dig around in the archives.
In a way you have an ideal setup for testing them out. Just make sure
you have backups...


Thanks for reference. Should I look for this patch here:

https://patchwork.kernel.org/project/linux-btrfs/list/?submitter=34632&order=-date

or this patch was only floating around in this maillist?

'btrfs check' without the --repair flag is safe and read only but
takes a long time because it'll read all metadata. The fastest safe
way is to mount it ro and read a directory recently being written to
and see if there are any kernel errors. You could recursively copy
files from a directory to /dev/null and then check kernel messages for
any errors. So long as metadata is DUP, there is a good chance a bad
copy of metadata can be automatically fixed up with a good copy. If
there's only single copy of metadata, or both copies get corrupt, then
it's difficult. Usually recovery of data is possible, but depending on
what's damaged, repair might not be possible.


I think "btrfs check" would be too heavy. Monitoring kernel errors is
something I was thinking about as well.

I didn't observe any errors while doing "btrfs check" on this volumeafter

several such resets, because that volume is mostly used for reading and
chance that USB reset happens during the write is very low.

--
With best regards,
Dmitry

Re: Failover for unattached USB device

Reply via email to