On Sat, Jan 28, 2023 at 8:33 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > On Thu, Jan 26, 2023 at 3:54 PM John Naylor > > <john.nay...@enterprisedb.com> wrote: > > > I think that we need to prevent concurrent updates (RT_SET() and > > RT_DELETE()) during the iteration to get the consistent result through > > the whole iteration operation. Unlike other operations such as > > RT_SET(), we cannot expect that a job doing something for each > > key-value pair in the radix tree completes in a short time, so we > > cannot keep holding the radix tree lock until the end of the > > iteration. > > This sounds like a performance concern, rather than a correctness concern, is > that right? If so, I don't think we should worry too much about optimizing > simple locking, because it will *never* be fast enough for highly-concurrent > read-write workloads anyway, and anyone interested in those workloads will > have to completely replace the locking scheme, possibly using one of the > ideas in the last ART paper you mentioned. > > The first implementation should be simple, easy to test/verify, easy to > understand, and easy to replace. As much as possible anyway.
Yes, but if a concurrent writer waits for another process to finish the iteration, it ends up waiting on a lwlock, which is not interruptible. > > > So the idea is that we set iter_active to true (with the > > lock in exclusive mode), and prevent concurrent updates when the flag > > is true. > > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer > that to waiting. Right. I think if we want to wait rather than an ERROR, the waiter should wait in an interruptible way, for example, a condition variable. I did a simpler way in the v22 patch. ...but looking at dshash.c, dshash_seq_next() seems to return an entry while holding a lwlock on the partition. My assumption might be wrong. > > > > Since there were calls to LWLockAcquire/Release in the last version, I'm > > > a bit confused by this. Perhaps for the next patch, the email should > > > contain a few sentences describing how locking is intended to work, > > > including for iteration. > > > > The lock I'm thinking of adding is a simple readers-writer lock. This > > lock is used for concurrent radix tree operations except for the > > iteration. For operations concurrent to the iteration, I used a flag > > for the reason I mentioned above. > > This doesn't tell me anything -- we already agreed on "simple reader-writer > lock", months ago I believe. And I only have a vague idea about the tradeoffs > made regarding iteration. > > + * WIP: describe about how locking works. > > A first draft of what is intended for this WIP would be a good start. This > WIP is from v23-0016, which contains no comments and a one-line commit > message. I'd rather not try closely studying that patch (or how it works with > 0011) until I have a clearer understanding of what requirements are assumed, > what trade-offs are considered, and how it should be tested. > > [thinks some more...] Is there an API-level assumption that hasn't been > spelled out? Would it help to have a parameter for whether the iteration > function wants to reserve the privilege to perform writes? It could take the > appropriate lock at the start, and there could then be multiple read-only > iterators, but only one read/write iterator. Note, I'm just guessing here, > and I don't want to make things more difficult for future improvements. Seems a good idea. Given the use case for parallel heap vacuum, it would be a good idea to support having multiple read-only writers. The iteration of the v22 is read-only, so if we want to support read-write iterator, we would need to support a function that modifies the current key-value returned by the iteration. > > > > Hmm, I wonder if we need to use the isolation tester. It's both a > > > blessing and a curse that the first client of this data structure is tid > > > lookup. It's a blessing because it doesn't present a highly-concurrent > > > workload mixing reads and writes and so simple locking is adequate. It's > > > a curse because to test locking and have any chance of finding bugs, we > > > can't rely on vacuum to tell us that because (as you've said) it might > > > very well work fine with no locking at all. So we must come up with test > > > cases ourselves. > > > > Using the isolation tester to test locking seems like a good idea. We > > can include it in test_radixtree. But given that the locking in the > > radix tree is very simple, the test case would be very simple. It may > > be controversial whether it's worth adding such testing by adding both > > the new test module and test cases. > > I mean that the isolation tester (or something else) would contain test > cases. I didn't mean to imply redundant testing. Okay, understood. > > > I think the user (e.g, vacuumlazy.c) can pass the maximum offset > > number to the parallel vacuum. > > Okay, sounds good. > > Most of v23's cleanups/fixes in the radix template look good to me, although > I didn't read the debugging code very closely. There is one exception: > > 0006 - I've never heard of memset'ing a variable to avoid "variable unused" > compiler warnings, and it seems strange. It turns out we don't actually need > this variable in the first place. The attached .txt patch removes the local > variable and just writes to the passed pointer. This required callers to > initialize a couple of their own variables, but only child pointers, at least > on gcc 12. Agreed with the attached patch. > And I will work later on making "value" in the public API a pointer. Thanks! > > 0017 - I haven't taken a close look at the new changes, but I did notice this > some time ago: > > + if (TidStoreIsShared(ts)) > + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); > + else > + return sizeof(TidStore) + sizeof(TidStore) + > + local_rt_memory_usage(ts->tree.local); > > There is repetition in the else branch. Agreed, will remove. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com