On Sep 22, 2015 8:58 PM, "Andrew Dunstan" <and...@dunslane.net> wrote: > > > > On 09/22/2015 12:16 PM, Shulgin, Oleksandr wrote: >> >> Hi Hackers, >> >> I've recently stumbled upon a problem with table bloat estimation in case there are columns of type JSON. >> >> The quick bloat estimation queries use sum over pg_statistic.stawidth of table's columns, but in case of JSON the corresponding entry is never created by the ANALYZE command due to equality comparison operator missing. I understand why there is no such operator defined for this particular type, but shouldn't we still try to produce meaningful average width estimation? >> >> In my case the actual bloat is around 40% as verified with pgstattuple, while the bloat reported by quick estimate can be between 75% and 95%(!) in three instances of this problem. We're talking about some hundreds of GB of miscalculation. >> >> Attached patch against master makes the std_typanalyze still try to compute the minimal stats even if there is no "=" operator. Makes sense? >> >> I could also find this report in archives that talks about similar problem, but due to all values being over the analyze threshold: >> >> http://www.postgresql.org/message-id/flat/12480.1389370...@sss.pgh.pa.us#12480.1389370...@sss.pgh.pa.us >> >> I think we could try harder, otherwise any estimate relying on average width can be way off in such cases. > > Yes, "/revenons/ à /nos moutons/." You can set up text based comparison ops fairly easily for json - you just need to be aware of the limitations. See https://gist.github.com/adunstan/32ad224d7499d2603708
Yes, I've already tried this approach and have found that analyze performance degrades an order of magnitude due to sql-level function overhead and casts to text. In my tests, from 200ms to 2000ms with btree ops on a default sample of 30,000 rows. Should have mentioned that. There is a very hacky way to substitute bttextcmp for the sort support function after defining the opclass by updating pg_amproc, buy I would rather avoid that. :-) -- Alex