On 4/16/20 7:42 AM, Stephen Long wrote: > +static inline uint8_t do_histseg_cnt(uint8_t n, uint64_t m0, uint64_t m1) > +{ > + int esz = 0;
Clearer to use MO_8. > + int bits = 8 << esz; > + uint64_t ones = dup_const(esz, 1); > + uint64_t signs = ones << (bits - 1); > + uint64_t cmp0, cmp1; > + > + cmp1 = dup_const(1, n); Error in the esz argument here. > + cmp0 = cmp1 ^ m0; > + cmp1 = cmp1 ^ m1; > + cmp0 = (cmp0 - ones) & ~cmp0; > + cmp1 = (cmp1 - ones) & ~cmp1; > + return ctpop64((cmp0 | cmp1) & signs); > +} Ah, well, I may have been too brief in my suggestion before. I encourage you to have a look at the bithacks patch and understand the algorithm here -- it's quite clever. We cannot simply OR the two halves together, since 8 | 8 == 8 loses one from the count of bits. So: cmp0 = (cmp0 - ones) & ~cmp0 & signs; cmp1 = (cmp1 - ones) & ~cmp1 & signs; /* * Combine the two compares in a way that the bits do * not overlap, and so preserves the count of set bits. * If the host has a efficient instruction for ctpop, * then ctpop(x) + ctpop(y) has the same number of * operations as ctpop(x | (y >> 1)). If the host does * not have an efficient ctpop, then we only want to * use it once. */ return ctpop64(cmp0 | (cmp1 >> 1)); > + for (j = 0; j < 64; j += 8) { > + uint8_t count0 = do_histseg_cnt(n0 >> j, m0, m1); > + out0 |= count0 << j; > + > + uint8_t count1 = do_histseg_cnt(n1 >> j, m0, m1); > + out1 |= count1 << j; > + } Wrong type for count0/count1 for shifting by e.g. 56. You might as well just use uint64_t as the return value from do_histseg_cnt() so that we don't get unnecessary zero-extensions from the compiler. r~