On Sat, Dec 17, 2016 at 5:12 PM, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: > On Sat, 17 Dec 2016 03:55 pm, Chris Angelico wrote: > >> More than that: the lists are searched in linear time, the binary >> seach runs in logarithmic time, but the set lookup is constant time. >> Doesn't matter how big your set is, you can test for membership with >> one hash lookup. > > To be pedantic: it is constant time on average.
True. And technically, you could say that a linear or binary search has a best-case of constant time (if you find it right at the beginning or middle of the list). I was talking about the average. > But outside of contrived or hostile examples, where elements of the set are > designed to collide, typically you would expect hash collisions to be rare, > and long chains of colliding elements even rarer. For random elements, > most will require only a single hash and a single equality test, a small > number might require two tests, three would be even rarer, and so forth. Yep. Normal case, it's going to be one or two tests - and that doesn't depend on the length of the list. The degenerate case does, but the normal case doesn't. > So strictly speaking, and I realise I'm being exceedingly pedantic here, a > *sufficiently large* dict or set MUST have colliding elements. How large > is "sufficiently large"? Its at least 2**31, more than two billion, so > while you are right that *in practice* set/dict lookups require only a > single hash + equality, in principle (and sometimes in practice too) > collisions can be significant. That assumes the hash has a finite size. If you have enough memory to store that many unique objects, you can probably afford to use a hashing algorithm that allows enough room. However, the chances of not having collisions depend on the capacity. And the chances of having no collisions at all are pretty low... as a rule of thumb, I estimate on a 50-50 chance of a collision when capacity is the square of usage. So if you have a thousand entries but capacity for a million, you have a 50% chance of having at least one collision. (The numbers aren't quite right, but it's a good rule of thumb.) > Nevertheless, I mention this only for completeness. In practice, you almost > never need to care about hash collisions except for the most pathological > cases. Indeed. That's why hash tables are so prevalent, despite their appalling worst-case. ChrisA -- https://mail.python.org/mailman/listinfo/python-list