On Tue, May 24, 2022 at 08:45:02PM +0100, Nikolaus Rath wrote: > > However, you are worried that a third possibility occurs: > > > > T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f > > for the unaligned head (it only needs 0x00-0x03, but we have to read a > > block at a time), to populate its buffer with IIIIBBBBBBBBBBBB. > > > > T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock > > blocking it. > > > > T2 now writes 0x00-0x0f using the contents of its buffer, resulting in: > > > > 0 0 0 0 1 1 1 1 > > 0...4...8...a...0...4...8...a... > > end3: IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII > > > > which does NOT reflect either of the possibilities where T1 and T2 > > write atomically. Basically, we have become the victim of sharding. > > Yes, this is the scenario that I am worried about. > > I this is a data corruption problem no matter if we assume that writes > should be atomically or not.
Writes to a single block should be atomic. Writes larger than a block need not be atomic. But when we advertise a particular block size, clients should be safe in assuming that anything smaller than that block size is inadvisable (either it will fail as too small, or it will incur a RMW penalty), but that something at the block size should not need client-side protection when no other parallel requests are overlapping that block. And that is where our blocksize filter is currently failing. > > In this scenario, the client has issued exactly one request to write > (among other things) bytes 0-4. This request was executed successfully, > so bytes 0-4 should have the new contents. There was no other write that > affected this byte range, so whether the write was done atomic or not > does not matter. Basically, because the blocksize filter advertised a block size of 1, the client was not expecting to need to avoid non-overlapping writes of anything larger than a 1-byte block. So yes, the more I think about this, the more I see it as a bug in the blocksize filter. > > > You are correct that it is annoying that this third possibility (where > > T1 appears to have never run) is possible with the blocksize filter. > > And we should probably consider it as a data-corruption bug. Your > > blocksize example of 10 (or 0x10) bytes is unlikely, but we are more > > likely to hit scenarios where an older guest assumes it is writing to > > 512-byte aligned hardware, while using the blocksize filter to try and > > guarantee RMW atomic access to 4k modern hardware. The older client > > will be unaware that it must avoid parallel writes that are > > 512-aligned but land in the same 4k page, so it seems like the > > blocksize filter should be doing that. > > > Yes, I was just picking a very small number to illustrate the problem. I > have seen this happen in practice with much larger blocksizes (32 kB+). > > > You have just demonstrated that our current approach (grabbing a > > single semaphore, only around the unaligned portions), does not do > > what we hoped. So what WOULD protect us, while still allowing as much > > parallelism as possible? > > How about a per-block lock as implemented for the S3 plugin in > https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10? That feels pretty heavyweight. It may allow more parallelism, but at the expense of more resources. I'm hoping that a single rwlock will do, although it may be dominated by starvation time when toggling between unaligned actions (serialized) vs aligned actions (parallel). > > It might be a bit harder to do in plain C because of the absence set > datatypes, but I think it's should work for the blocksize filter as well. Just because the language does not have a builtin set type doesn't mean we can't code one; but you're right that it would be more code. I'm still hoping to post a first draft of a rwlock in the blocksize filter later today (the hardest part is trying to come up with a testsuite that will reliably demonstrate the race without the patch). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs