On 2017-10-16 12:57, Zoltan wrote:
Hi,

On Mon, Oct 16, 2017 at 1:53 PM, Austin S. Hemmelgarn wrote:

you will need to scrub regularly to avoid data corruption

Is there any indication that a scrub is needed? Before actually doing
a scrub, is btrfs already aware that one of the devices did not
receive all data due to being unavailable for a brief time? If so,
which command shows this info in its output?
In an ideal situation, scrubbing should not be an 'only if needed' thing, even for a regular array that isn't dealing with USB issues. From a practical perspective, there's no way to know for certain if a scrub is needed short of reading every single file in the filesystem in it's entirety, at which point, you're just better off running a scrub (because if you _do_ need to scrub, you'll end up reading everything twice).

If you insist on spot-checking things, you can check the output of `btrfs device stats` for the filesystem. If any numbers there are non-zero, then some file that you've accessed _since the last time you reset the counters_ has corruption. If you go this route, make sure to reset the counters with `btrfs device stats -z` _immediately_ after you run a scrub, or in some way track their values externally to compare against.

Additionally, how does btrfs scrub compare to btrfs balance
-dconvert=raid1,soft -mconvert=raid1,soft in this scenario? I would
suppose that if btrfs is aware that some data does not have a
replication count of 2, then a convert could fix that without a scrub
reading through the whole disk. On the other hand, while I would
expect btrfs scrub to find data with bad checksum, I would not expect
it do balance as well in order to achieve the desired replication
count of 2 for all data. So do I need to run both a scrub and a
convert, or is a scrub enough?
It kind of depends.  There are three things to deal with here:
1. Latent data corruption caused either by bit rot, or by a half-write (that is, one copy got written successfully, then the other device disappeared _before_ the other copy got written).
2. Single chunks generated when the array is degraded.
3. Half-raid1 chunks generated by newer kernels when the array is degraded.

Scrub will fix problem 1 because that's what it's designed to fix. it will also fix problem 3, since that behaves just like problem 1 from a higher-level perspective. It won't fix problem 2 though, as it doesn't look at chunk types (only if the data in the chunk doesn't have the correct number of valid copies).

In contrast, the balance command you quoted won't fix issue 1 (because it doesn't validate checksums or check that data has the right number of copies), or issue 3 (because it's been told to only operate on non-raid1 chunks), but it will fix issue 2.

In comparison to both of the above, a full balance without filters will fix all three issues, although it will do so less efficiently (in terms of both time and disk usage) than running a soft-conversion balance followed by a scrub.

In the case of normal usage, device disconnects are rare, so you should generally be more worried about latent data corruption. As a result, for most normal users, I would suggest running the balance command you gave daily (it will usually finish instantly, so there's no point in not running it frequently to help ensure data safety) and a scrub daily or weekly (this is the one that matters more here, since you need to worry more about latent data corruption).

For your use case though, I would instead suggest setting something up to monitor the kernel log to watch for device disconnects, remount the filesystem when the device reconnects, and then run the balance command followed by a scrub. With most hardware I've seen, USB disconnects tend to be relatively frequent unless you're using very high quality cabling and peripheral devices. If, however, they happen less than once a day most of the time, just set up the log monitor to remount, and set the balance and scrub commands on the schedule I suggested above for normal usage.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to