Re: [lttng-dev] urcu/rculist.h clarifications - for implementing LRU

Mathieu Desnoyers via lttng-dev Mon, 13 Mar 2023 13:03:16 -0700

On 2023-03-13 11:30, Ondřej Surý wrote:

Hi Matthieu,


I spent some more time with the userspace-rcu on Friday and over weekend and
now I am in much better place.

On 13. 3. 2023, at 15:29, Mathieu Desnoyers <mathieu.desnoy...@efficios.com> 
wrote:

On 2023-03-11 01:04, Ondřej Surý via lttng-dev wrote:

Hey,
so, we are integrating userspace-rcu to BIND 9 (yay!) and as experiment,
I am rewriting the internal address database (keeps the infrastructure
information about names and addresses).


That's indeed very interesting !


Thanks for the userspace-rcu! It saves a lot of time - while my colleague Tony 
Finch
already wrote our internal QSBR implementation from scratch, it would be waste 
of
time to try to reimplement the CDS part of the library.

This is part of larger work to replace the internal BIND 9 database that's 
currently
implemented as rwlocked RBT with qptrie, if you are interested Tony has good
summary here: https://dotat.at/@/2023-02-28-qp-bind.html


Speaking of tries, I have implemented RCU Judy arrays in liburcu feature 
branches a
while back. Those never made it to the liburcu master branch because I had no 
real-life
use for those so far, and I did not want to expose a public API that would 
bitrot without
real-life user feedback.

The lookups and ordered traversals (next/prev) are entirely RCU, and updates are
either single-threaded, or use a strategy where locking is distributed within
the trie so updates to data spatially discontinuous would not contend with each 
other.

My original implementation supported integer keys as well as variable-length 
string keys.

The advantage of Judy arrays is that it minimizes the number of cache-lines 
touched
on lookup traversal. Let me know if this would be useful for your use-cases, 
and if
so I can provide links to prototype branches.

[...]

So this is part with the hashtable lookup which seems to work well:
         rcu_read_lock();
         struct cds_lfht_iter iter;
         struct cds_lfht_node *ht_node;
         cds_lfht_lookup(adb->names_ht, hashval, names_match, &key, &iter);
         ht_node = cds_lfht_iter_get_node(&iter);
         bool unlink = false;
         if (ht_node == NULL) {
                 /* Allocate a new name and add it to the hash table. */
                 adbname = new_adbname(adb, name, start_at_zone);
                 ht_node = cds_lfht_add_unique(adb->names_ht, hashval,
                                               names_match, &key,
                                               &adbname->ht_node);
                 if (ht_node != &adbname->ht_node) {
                         /* ISC_R_EXISTS */
                         destroy_adbname(adbname);
                         adbname = NULL;
                 }
         }
         if (adbname == NULL) {
                 INSIST(ht_node != NULL);
                 adbname = caa_container_of(ht_node, dns_adbname_t, ht_node);
                 unlink = true;
         }
         dns_adbname_ref(adbname);


What is this dns_adbname_ref() supposed to do ? And is there a reference to 
adbname
that is still used after rcu_read_unlock() ? What guarantees the existence of 
the
adbname after rcu_read_unlock() ?


This is part of the internal reference counting - there's a macro that expects 
`isc_refcount_t references;`
member on the struct and it creates _ref, _unref, _attach and _detach functions 
for each struct.

The last _detach/_unref calls a destroy function.

         rcu_read_unlock();
and here's the part where LRU gets updated:
         LOCK(&adbname->lock); /* Must be unlocked by the caller */


I suspect you use a scheme where you hold the RCU read-side to perform the 
lookup, and
then you use the object with an internal lock held. But expecting the object to 
still
exist after rcu read unlock is incorrect, unless some other reference counting 
scheme
is used.


Yeah, I was trying to minimize the sections where we hold the rcu_read locks, 
but I gave
up and now there's rcu_read lock held for longer periods of time.


We've used that kind of scheme in LTTng lttng-relayd, where we use RCU for 
short-term
existence guarantee, and reference counting for longer-term existence 
guarantee. An
example can be found here:

https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-relayd/viewer-stream.cpp

viewer_stream_get_by_id() attempts lookup from the hash table, and re-validates 
that the
object exists with viewer_stream_get(), which checks if the refcount is already 
0 as it
tries to increment it with urcu_ref_get_unless_zero(). If zero, it does as if 
the object
was not found. I recommend this kind of scheme if you intend to use both RCU 
and reference
counting.

Then you can place a mutex within the object, and use that mutex to provide 
mutual
exclusion between concurrent accesses to the object that need to be serialized.

In the destroy handler (called when the reference count reaches 0), you will 
typically
want to unlink your object from the various data structures holding references 
to it
(hash tables, lists), and then use call_rcu() to invoke reclaim of the object 
after a
grace period.

         if (NAME_DEAD(adbname)) {
                 UNLOCK(&adbname->lock);
                 dns_adbname_detach(&adbname);
                 goto again;
         }
         if (adbname->last_used + ADB_CACHE_MINIMUM <= last_update) {
                 adbname->last_used = now;
                 LOCK(&adb->names_lru_lock);
                 if (unlink) {
                         cds_list_del_rcu(&adbname->list_node);
                 }


This looks odd. I don't see the code implementing traversal of this list, but
I would expect a grace period between unlink of the node from a list and 
insertion
into another list, otherwise if there are RCU readers traversing the list
concurrently, they can observe an inconsistent state.


That's probably the part I am getting wrong. This all is fairly new to me and 
my brain
is still adjusting to the new paradigms.


One way would be to invoke synchronize_rcu() between del_rcu and add_tail_rcu,
but it would probably be too costly performance-wise.

Another way that would be a better fit to the RCU mindset would be to create a
copy of the adbname object, add _that_ object to the tail, and use call_rcu
to free the old object that has been removed. There is of course extra overhead
associated with this copy.

                 cds_list_add_tail_rcu(&adbname->list_node, &adb->names_lru);
                 UNLOCK(&adb->names_lru_lock);
         }
The NAME_DEAD gets updated under the adbname->lock in expire_name():
         if (!NAME_DEAD(adbname)) {
                 adbname->flags |= NAME_IS_DEAD;
                 /* Remove the adbname from the hashtable... */
                 (void)cds_lfht_del(adb->names_ht, &adbname->ht_node);


I don't have the full context here, but AFAIR cds_lfht_del() allows two removals
of the same ht_node to be done concurrently, and only one will succeed (which is
probably what happens here). cds_list_del_rcu() however does not allow 
concurrent
removals of a list_node. So if you somehow get two RCU lookups to find the same
node in expire_name, one will likely do an extra unexpected cds_list_del_rcu().


Ah, that might explain the behaviour.  However the current branch doesn't 
manifest
this anymore.


You can theoretically test for success/failure result of cds_lfht_del to prevent
doing the problematic list_del_rcu. But beware of double-free or use-after-free 
if
you end up relying on success of cds_lfht_del to prevent tearing down the same
object twice.

                 /* ... and LRU list */
                 LOCK(&adb->names_lru_lock);
                 cds_list_del_rcu(&adbname->list_node);
                 UNLOCK(&adb->names_lru_lock);
         }
So, now the problem is that sometimes I get a crash under load:
(gdb) bt
#0  0x00007fae87a34c96 in cds_list_del_rcu (elem=0x7fae37e78880) at 
/usr/include/x86_64-linux-gnu/urcu/rculist.h:71
#1  get_attached_and_locked_name (adb=adb@entry=0x7fae830142a0, 
name=name@entry=0x7fae804fc9b0, start_at_zone=true, now=<optimized out>) at 
adb.c:1446
#2  0x00007fae87a392bf in dns_adb_createfind (adb=0x7fae830142a0, loop=0x7fae842c3a20, 
cb=cb@entry=0x7fae87b28d9f <fctx_finddone>, cbarg=0x7fae7c679000, 
name=name@entry=0x7fae804fc9b0, qname=0x7fae7c679010, qtype=1, options=63, 
now=<optimized out>, target=0x0, port=53, depth=1, qc=0x7fae7c651060, 
findp=0x7fae804fc698) at adb.c:2149
(gdb) frame 0
#0  0x00007fae87a34c96 in cds_list_del_rcu (elem=0x7fae37e78880) at 
/usr/include/x86_64-linux-gnu/urcu/rculist.h:71
71 elem->next->prev = elem->prev;
(gdb) print elem->next
$1 = (struct cds_list_head *) 0x0
(gdb) print elem
$2 = (struct cds_list_head *) 0x7fae37e78880
So, I suspect, I am doing something wrong when updating the position of the the 
name in the LRU list.
There are couple of places where we iterate through the LRU list (overmem 
cleaning can kick-in, the user initiated cleaning can start, shutdown can be 
happening...)


It gets me to wonder whether you really need RCU for the LRU list ? Are those 
lookups
very frequent ? And do they typically end up needing to grab a lock to protect 
against
concurrent list modifications ?


These are lookups that happen frequently when the cache is cold - it keeps the 
delegation
mapping (e.g. **name** is server by **nameserver** (adbname) that has these 
**IP addresses**
(adbentries)).


OK.

So the characteristics of this LRU, AFAIU, are:

- frequent enqueue (most recent),
- frequent "pick" item within the list to re-enqueue it as most recent,
- frequent dequeue of the last N oldest items.

However I don't see that iteration on the list is inherently needed, at least
not frequently, except as means to implement the operations above, am I correct 
?


There are two more places in the resolver hot path that are using hashtables 
now which I intend
to replace with rculfhash:

1. "finds" (this is for coalescing client requests, e.g. if 100 clients asks 
for google.com, the should be
    only one outgoing query)

2. resolver counter - these are used as bandaid protection against 
random-subdomain attacks, but
    the logic is similar - make a lookup for parent domain and if the counter exceeds 
<limit>, drop the
    extra query.

These four are in the cold cache resolver hotpath and are/were originally 
protected by RWLOCK
(using non-glibc RWLOCK helped a bit, but RCU is better).


Yes, rwlock is known to have difficulty scaling compared to RCU. :)

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

Re: [lttng-dev] urcu/rculist.h clarifications - for implementing LRU

Reply via email to