On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote: > FWIW, so that I could get an idea of how the stats change as we improve the > hashing, I inserted the attachment 1 patch between patches 5 and 6, and with > attachment 2 attempting to fix the accounting for patches 9 and 10.
For qht, I dislike the approach of reporting "avg chain" per-element, instead of per-bucket. Performance for a bucket whose entries are all valid is virtually the same as that of a bucket that only has one valid element; thus, with per-bucket reporting, we'd say that the chain lenght is 1 in both cases, i.e. "perfect". With per-element reporting, we'd report 4 (on a 64-bit host, since that's the value of QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the wrong idea (users would think they're in trouble, when they're not). Using the avg-bucket-chain metric you can test how good the hashing is. For instance, the metric is 1.01 for xxhash with phys_pc, pc and flags (i.e. func5), and 1.21 if func5 takes only a valid phys_pc (the other two are 0). I think reporting fully empty buckets as well as the longest chain (of buckets for qht) in addition to this metric is a good idea, though. > For booting an alpha kernel to login prompt: > > Before hashing changes (@5/11) > > TB count 175363/671088 > TB invalidate count 3996 > TB hash buckets 31731/32768 > TB hash avg chain 5.289 max=59 > > After xxhash patch (@7/11) > > TB hash buckets 32582/32768 > TB hash avg chain 5.260 max=18 > > So far so good! > > After qht patches (@11/11) > > TB hash buckets 94360/131072 > TB hash avg chain 1.774 max=8 > > Do note that those last numbers are off: 1.774 avg * 94360 used buckets = > 167394 total entries, which is far from 171367, the correct number of total > entries. If those numbers are off, then either this assert(hinfo.used_entries == tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count); should trigger, or the accounting isn't right. Another option is that the "TB count - invalidate_count" is different for each test you ran. I think this is what's going on, otherwise we couldn't explain why the first report ("before 5/11") is also "wrong": 5.289*31731=167825.259 Only the second report ("after 7/11") seems good (taking into account lack of precision of just 3 decimals): 5.26*32582=171381.32 ~= 171367 which leads me to believe that you've used the TB and invalidate counts from that test. I just tested your patches (on an ARM bootup) and the assert doesn't trigger, and the stats are spot on for "after 11/11": TB count 643610/2684354 TB hash buckets 369534/524288 TB hash avg chain 1.729 max=8 TB flush count 0 TB invalidate count 4718 1.729*369534=638924.286, which is ~= 643610-4718 = 638892. > I'm tempted to pull over gcc's non-chaining hash table implementation > (libiberty/hashtab.c, still gplv2+) and compare... You can try, but I think performance wouldn't be great, because the comparison function would be called way too often due to the ht using open addressing. The problem there is not only the comparisons themselves, but the all the cache lines needed to read the fields of the comparison. I haven't tested libiberty's htable but I did test the htable in concurrencykit[1], which also uses open addressing. With ck's ht, performance was not good when booting ARM: IIRC ~30% of runtime was spent on tb_cmp(); I also added the full hash to each TB so that it would be compared first, but it didn't make a difference since the delay was due to loading the cache line (I saw this with perf(1)'s annotated code, which showed that ~80% of the time spent in tb_cmp() was in performing the first load of the TB's fields). This led me to a design that had buckets with a small set of hash & pointer pairs, all in the same cache line as the head (then I discovered somebody else had thought of this, and that's why there's a link to the CLHT paper in qht.c). BTW I tested ck's htable also because of a requirement we have for MTTCG, which is to support lock-free concurrent lookups. AFAICT libiberty's ht doesn't support this, so it might be a bit faster than ck's. Thanks, Emilio [1] http://concurrencykit.org/ More info on their htable implementation here: http://backtrace.io/blog/blog/2015/03/13/workload-specialization/