On 21/05/16 05:48, Emilio G. Cota wrote: > On Sat, May 21, 2016 at 01:13:20 +0300, Sergey Fedorov wrote: >> Although the API is mostly intuitive some kernel-doc-style comments >> wouldn’t hurt, I think. ;-) > The nit that bothered me is the "external lock needed" bit, but it's > removed by the subsequent patch (which once it gets reviewed should be merged > onto this patch); I think the interface is simple enough that comments > would just add noise and maintenance burden. Plus, there are tests under > tests/.
The interface is simple enough but e.g. the return value convention for some of the functions may not be clear from a first glance. Regarding maintenance burden, as soon as we have a good stable API it shouldn't be painful. > (snip) >>> +/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */ >>> +#if HOST_LONG_BITS == 32 >>> +#define QHT_BUCKET_ENTRIES 6 >>> +#else /* 64-bit */ >>> +#define QHT_BUCKET_ENTRIES 4 >>> +#endif >>> + >>> +struct qht_bucket { >>> + QemuSpin lock; >>> + QemuSeqLock sequence; >>> + uint32_t hashes[QHT_BUCKET_ENTRIES]; >>> + void *pointers[QHT_BUCKET_ENTRIES]; >>> + struct qht_bucket *next; >>> +} QEMU_ALIGNED(QHT_BUCKET_ALIGN); >>> + >>> +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN); >> Have you considered using separate structures for head buckets and >> non-head buckets, e.g. "struct qht_head_bucket" and "struct >> qht_added_bucket"? This would give us a little more entries per cache-line. > I considered it. Note however that the gain would only apply to > 32-bit hosts, since on 64-bit we'd only save 8 bytes but we'd > need 12 to store hash+pointer. (lock+sequence=8, hashes=4*4=16, > pointers=4*8=32, next=8, that is 8+16+32+8=32+32=64). > > On 32-bits with 6 entries we have 4 bytes of waste; we could squeeze in > an extra entry. I'm reluctant to do this because (1) it would complicate > code and (2) I don't think we should care too much about performance on > 32-bit hosts. Fair enough. > (snip) >>> +static inline >>> +void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func, >>> + const void *userp, uint32_t hash) >>> +{ >>> + struct qht_bucket *b = head; >>> + int i; >>> + >>> + do { >>> + for (i = 0; i < QHT_BUCKET_ENTRIES; i++) { >>> + if (atomic_read(&b->hashes[i]) == hash) { >>> + void *p = atomic_read(&b->pointers[i]); >> Why do we need this atomic_read() and other (looking a bit inconsistent) >> atomic operations on 'b->pointers' and 'b->hash'? if we always have to >> access them protected properly by a seqlock together with a spinlock? > [ There should be consistency: read accesses use the atomic ops to read, > while write accesses have acquired the bucket lock so don't need them. > Well, they need care when they write, since there may be concurrent > readers. ] Well, I see the consistency now =) > I'm using atomic_read but what I really want is ACCESS_ONCE. That is: > (1) Make sure that the accesses are done in a single instruction (even > though gcc doesn't explicitly guarantee it even to aligned addresses > anymore[1]) > (2) Make sure the pointer value is only read once, and never refetched. > This is what comes right after the pointer is read: >> + if (likely(p) && likely(func(p, userp))) { >> + return p; >> + } > Refetching the pointer value might result in us passing something > a NULL p value to the comparison function (since there may be > concurrent updaters!), with an immediate segfault. See [2] for a > discussion on this (essentially the compiler assumes that there's > only a single thread). > > Given that even reading a garbled hash is OK (we don't really need (1), > since the seqlock will make us retry anyway), I've changed the code to: > > for (i = 0; i < QHT_BUCKET_ENTRIES; i++) { > - if (atomic_read(&b->hashes[i]) == hash) { > + if (b->hashes[i] == hash) { > + /* make sure the pointer is read only once */ > void *p = atomic_read(&b->pointers[i]); > > if (likely(p) && likely(func(p, userp))) { > > Performance-wise this is the impact after 10 tries for: > $ taskset -c 0 tests/qht-bench \ > -d 5 -n 1 -u 0 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096 > on my Haswell machine I get, in Mops/s: > atomic_read() for all 40.389 +- 0.20888327415622 > atomic_read(p) only 40.759 +- 0.212835356294224 > no atomic_read(p) (unsafe) 40.559 +- 0.121422128680622 > > Note that the unsafe version is slightly slower; I guess the CPU is trying > to speculate too much and is gaining little from it. > > [1] "Linux-Kernel Memory Model" by Paul McKenney > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4374.html > [2] https://lwn.net/Articles/508991/ Okay. > (snip) >>> +/* >>> + * Find the last valid entry in @head, and swap it with @orig[pos], which >>> has >>> + * just been invalidated. >>> + */ >>> +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos) >>> +{ >>> + struct qht_bucket *b = orig; >>> + struct qht_bucket *prev = NULL; >>> + int i; >>> + >>> + if (qht_entry_is_last(orig, pos)) { >>> + return; >>> + } >>> + do { >>> + for (i = 0; i < QHT_BUCKET_ENTRIES; i++) { >> We could iterate in the opposite direction: from the last entry in a >> qht_bucket to the first. It would allow us to fast-forward to the next >> qht_bucket in a chain in case of non-NULL last entry and speed-up the >> search. > But it would slow us down if--say--only the first entry is set. Also > it would complicate the code a bit. > > Note that with the resizing threshold that we have, we're guaranteed to > have only up to 1/8 of the head buckets full. We should therefore optimize > for the case where the head bucket isn't full. Okay. Kind regards, Sergey