[Bug libstdc++/98781] New: Poor std::hash for bitset and vector

redi at gcc dot gnu.org via Gcc-bugs Thu, 21 Jan 2021 03:23:55 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98781


            Bug ID: 98781
           Summary: Poor std::hash for bitset and vector<bool>
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: ABI
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: redi at gcc dot gnu.org
  Target Milestone: ---

The std::hash specializations for std::bitset and std::vector<bool> do not
differentiate between sequences that have different number of zero bits at the
end, if they don't spill over into a new byte.

e.g. the following all have the same representation, a single byte with the low
bit set:

std::bitset<1> b1(1);
std::bitset<2> b2(1);
std::bitset<8> b8(1);

As a result, they all hash to the same value. This is less problematic for
std::bitset where they have different types anyway, but it's more noticeable
for std::vector<bool>:

std::vector<bool> b{true};
std::hash<std::vector<bool>> h;
for (int i = 0; i < 10; ++i)
{
  std::cout << b.size() << '\t' << h(b) << '\n';
  b.push_back(false);
}

Each iteration produces different values (which don't compare equal) but the
hash value only changes when the size increases by a multiple of CHAR_BIT:

1       4334672815104069193
2       4334672815104069193
3       4334672815104069193
4       4334672815104069193
5       4334672815104069193
6       4334672815104069193
7       4334672815104069193
8       4334672815104069193
9       9994027557101822035
10      9994027557101822035


The hash function could combine the result of hashing the value bytes with the
hash of the sequence length. That would produce a distinct hash for each value
above. That would be an ABI change though, as different translation units could
give different hash results for the same inputs.

FWIW MSVC has the same problem, but the hash only changes for multiples of 32,
and libc++ never changes the result for any number of trailing zeros (it
appears to hash vector<bool> as a simple XOR of the integer words).

[Bug libstdc++/98781] New: Poor std::hash for bitset and vector

Reply via email to