On Tue, Sep 2, 2014 at 4:41 PM, Peter Geoghegan <p...@heroku.com> wrote: > HyperLogLog isn't sample-based - it's useful for streaming a set and > accurately tracking its cardinality with fixed overhead.
OK. >> Is it the right decision to suppress the abbreviated-key optimization >> unconditionally on 32-bit systems and on Darwin? There's certainly >> more danger, on those platforms, that the optimization could fail to >> pay off. But it could also win big, if in fact the first character or >> two of the string is enough to distinguish most rows, or if Darwin >> improves their implementation in the future. If the other defenses >> against pathological cases in the patch are adequate, I would think >> it'd be OK to remove the hard-coded checks here and let those cases >> use the optimization or not according to its merits in particular >> cases. We'd want to look at what the impact of that is, of course, >> but if it's bad, maybe those other defenses aren't adequate anyway. > > I'm not sure. Perhaps the Darwin thing is a bad idea because no one is > using Macs to run real database servers. Apple haven't had a server > product in years, and typically people only use Postgres on their Macs > for development. We might as well have coverage of the new code for > the benefit of Postgres hackers that favor Apple machines. Or, to look > at it another way, the optimization is so beneficially that it's > probably worth the risk, even for more marginal cases. > > 8 primary weights (the leading 8 bytes, frequently isomorphic to the > first 8 Latin characters, regardless of whether or not they have > accents/diacritics, or punctuation/whitespace) is twice as many as 4. > But every time you add a byte of space to the abbreviated > representation that can resolve a comparison, the number of > unresolvable-without-tiebreak comparisons (in general) is, I imagine, > reduced considerably. Basically, 8 bytes is way better than twice as > good as 4 bytes in terms of its effect on the proportion of > comparisons that are resolved only with abbreviated keys. Even still, > I suspect it's still worth it to apply the optimization with only 4. > > You've seen plenty of suggestions on assessing the applicability of > the optimization from me. Perhaps you have a few of your own. My suggestion is to remove the special cases for Darwin and 32-bit systems and see how it goes. > That wouldn't be harmless - it would probably result in incorrect > answers in practice, and would certainly be unspecified. However, I'm > not reading uninitialized bytes. I call memset() so that in the event > of the final strxfrm() blob being less than 8 bytes (which can happen > even on glibc with en_US.UTF-8). It cannot be harmful to memcmp() > every Datum byte if the remaining bytes are always initialized to NUL. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers