Re: [HACKERS] proposal : cross-column stats

Tomas Vondra Mon, 13 Dec 2010 14:46:03 -0800

Dne 13.12.2010 22:50, Josh Berkus napsal(a):
> Tomas,
> 
>>   (a) find out what statistics do we need to collect and how to use it
>>   (b) implement a really stupid inefficient solution
>>   (c) optimize in iterations, i.e. making it faster, consuming less
>>       space etc.
> 
> I'll suggest again how to decide *which* columns to cross: whichever
> columns are combined in composite indexes.  In version 2, allow the DBA
> to specify combinations.
> 
> In the unlikely event that correlation could be reduced to a single
> float number, it would be conceivable for each column to have an array
> of correlation stats for every other column where correlation was
> non-random; on most tables (i.e. ones with less than 100 columns) we're
> not talking about that much storage space.
> 
> The main cost would be the time spent collecting that info ...


I think this is a bit early to discuss this, given the fact that we
don't have a working solution yet. But OK, let's discuss these options
anyway

1) collecting the automatically for composite indexes

   I don't think this is wise idea. The first versions definitely won't
   be very efficient, and collecting the data for each composite
   index means everyone will be hit by this inefficiency, even if he
   actually does not need that (e.g. the columns are independent so the
   current estimates are quite accurate or he's not using those columns
   very often in the same WHERE clause).

   Another reason against this is that many DBAs don't actually use
   composed indexes - they simply create indexes on each column and let
   the bitmap index scan to work it out. And this would not work for
   this case.

   And actually it's not very complicated to allow the DBA to do this,
   this can be a quite simple PL/pgSQL procedure.

2) collecting correlation for each pair of columns

   Again, you're effectively forcing everyone to pay the price even
   though he may not need the feature. Maybe we'll get there one day,
   but it's not a good idea to do that from the beginning.

   And the correlation itself has a very limited use in real life, as
   it's not possible to compute it for character columns and is not
   very useful in case of some numerical columns (e.g. ZIP codes).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to