On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
> On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> >
> > On Thu, Jan 26, 2023 at 3:54 PM John Naylor
> > <john.nay...@enterprisedb.com> wrote:
>
> > I think that we need to prevent concurrent updates (RT_SET() and
> > RT_DELETE()) during the iteration to get the consistent result through
> > the whole iteration operation. Unlike other operations such as
> > RT_SET(), we cannot expect that a job doing something for each
> > key-value pair in the radix tree completes in a short time, so we
> > cannot keep holding the radix tree lock until the end of the
> > iteration.
>
> This sounds like a performance concern, rather than a correctness concern, is 
> that right? If so, I don't think we should worry too much about optimizing 
> simple locking, because it will *never* be fast enough for highly-concurrent 
> read-write workloads anyway, and anyone interested in those workloads will 
> have to completely replace the locking scheme, possibly using one of the 
> ideas in the last ART paper you mentioned.
>
> The first implementation should be simple, easy to test/verify, easy to 
> understand, and easy to replace. As much as possible anyway.

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

>
> > So the idea is that we set iter_active to true (with the
> > lock in exclusive mode), and prevent concurrent updates when the flag
> > is true.
>
> ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer 
> that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

>
> > > Since there were calls to LWLockAcquire/Release in the last version, I'm 
> > > a bit confused by this. Perhaps for the next patch, the email should 
> > > contain a few sentences describing how locking is intended to work, 
> > > including for iteration.
> >
> > The lock I'm thinking of adding is a simple readers-writer lock. This
> > lock is used for concurrent radix tree operations except for the
> > iteration. For operations concurrent to the iteration, I used a flag
> > for the reason I mentioned above.
>
> This doesn't tell me anything -- we already agreed on "simple reader-writer 
> lock", months ago I believe. And I only have a vague idea about the tradeoffs 
> made regarding iteration.
>
> + * WIP: describe about how locking works.
>
> A first draft of what is intended for this WIP would be a good start. This 
> WIP is from v23-0016, which contains no comments and a one-line commit 
> message. I'd rather not try closely studying that patch (or how it works with 
> 0011) until I have a clearer understanding of what requirements are assumed, 
> what trade-offs are considered, and how it should be tested.
>
> [thinks some more...] Is there an API-level assumption that hasn't been 
> spelled out? Would it help to have a parameter for whether the iteration 
> function wants to reserve the privilege to perform writes? It could take the 
> appropriate lock at the start, and there could then be multiple read-only 
> iterators, but only one read/write iterator. Note, I'm just guessing here, 
> and I don't want to make things more difficult for future improvements.

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

>
> > > Hmm, I wonder if we need to use the isolation tester. It's both a 
> > > blessing and a curse that the first client of this data structure is tid 
> > > lookup. It's a blessing because it doesn't present a highly-concurrent 
> > > workload mixing reads and writes and so simple locking is adequate. It's 
> > > a curse because to test locking and have any chance of finding bugs, we 
> > > can't rely on vacuum to tell us that because (as you've said) it might 
> > > very well work fine with no locking at all. So we must come up with test 
> > > cases ourselves.
> >
> > Using the isolation tester to test locking seems like a good idea. We
> > can include it in test_radixtree. But given that the locking in the
> > radix tree is very simple, the test case would be very simple. It may
> > be controversial whether it's worth adding such testing by adding both
> > the new test module and test cases.
>
> I mean that the isolation tester (or something else) would contain test 
> cases. I didn't mean to imply redundant testing.

Okay, understood.

>
> > I think the user (e.g, vacuumlazy.c) can pass the maximum offset
> > number to the parallel vacuum.
>
> Okay, sounds good.
>
> Most of v23's cleanups/fixes in the radix template look good to me, although 
> I didn't read the debugging code very closely. There is one exception:
>
> 0006 - I've never heard of memset'ing a variable to avoid "variable unused" 
> compiler warnings, and it seems strange. It turns out we don't actually need 
> this variable in the first place. The attached .txt patch removes the local 
> variable and just writes to the passed pointer. This required callers to 
> initialize a couple of their own variables, but only child pointers, at least 
> on gcc 12.

Agreed with the attached patch.

>  And I will work later on making "value" in the public API a pointer.

Thanks!

>
> 0017 - I haven't taken a close look at the new changes, but I did notice this 
> some time ago:
>
> + if (TidStoreIsShared(ts))
> + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
> + else
> + return sizeof(TidStore) + sizeof(TidStore) +
> + local_rt_memory_usage(ts->tree.local);
>
> There is repetition in the else branch.

Agreed, will remove.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


Reply via email to