On Wed, Apr 11, 2012 at 01:53:06AM +0100, Peter Geoghegan wrote: > On 11 April 2012 01:16, Tom Lane <t...@sss.pgh.pa.us> wrote: > > Peter Geoghegan <pe...@2ndquadrant.com> writes: > >> On 11 April 2012 00:35, Robert Haas <robertmh...@gmail.com> wrote: > >>> If people need something like that, couldn't they create it by hashing > >>> the normalized query text with an arbitrary algorithm? > > > >> That supposes that the normalised query text is perfectly stable. It > >> may well not be, particularly for things like ad-hoc queries or > >> queries generated by ORMs, across database clusters and over long > >> periods of time - > > > > Indeed, but the hash value isn't stable either given those sorts of > > assumptions, so I'm not convinced that there's any advantage there. > > Isn't it? The hash captures the true meaning of the query, while > having the database server's platform as a usually irrelevant > artefact. Another thing that I forgot to mention is client encoding - > it may well be fairly inconvenient to have to use the same algorithm > to hash the query string across applications. You also have to hash > the query string yourself again and again, which is expensive to do > from Python or something, and is often inconvenient - differences > beyond track_activity_query_size bytes (default:1024) are not > recognised. Using an SQL code beautifier where a single byte varies > now breaks everything, which developers don't expect at all (we've > trained them not to), so in many ways you're back to the same > limitations as classic pg_stat_statements if you attempt to aggregate > queries over time and across machines, which is a very real use case. > > It's probably pretty annoying to have to get your Python app to use > the same hash function as your Java app or whatever I, unless you want > to use something heavyweight like a cryptographic hash function. By > doing it within Postgres, you avoid those headaches. > > I'm not asking you to very loudly proclaim that it should be used like > this - just expose it, accurately document it, and I'm quite confident > that it will be widely used and relied upon by those that are > reasonably well informed, and understand its limitations, which are > really quite straightforward. > > > What I think people would actually like to know, if they're in a > > situation where distinct query texts are getting hashed to the same > > thing, is *which* different texts got hashed to the same thing. > > But there's no good way to expose that given the pg_stat_statements > > infrastructure, and exposing the hash value doesn't help. > > Apart from detecting the case where we get a straightforward > collision, I don't expect that that would be useful. The whole point > is that the user doesn't care about the difference, and I think we've > specified a practical, widely useful standard for when queries should > be considered equivalent. > -- > Peter Geoghegan http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training and Services >
By using all 64-bits of the hash that we currently calculate, instead of the current use of 32-bits only, the collision probabilities are very low. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers