> On Mon, Feb 07, 2005 at 13:28:04 -0500, > > What you are saying here is that if you want more accurate statistics, you > need to sample more rows. That is true. However, the size of the sample > is essentially only dependent on the accuracy you need and not the size > of the population, for large populations. > That's nonsense.
If your total data size is 100 elements in a set, then a sample size of 100 elements will cover 100% of your data. If your total data size is 10,000 elements in a set, the a sample size of 100 elements will cover 1% of your data. In the case of the TIGER database, the base of 100 samples is about .002% 0f the data is sampled. Think about that, that is an average of 1 sample about every 50,000 records. You could have substantial but irregular trends in the data that may never get detected, and this is EXACTLY what we see. If we increase the sample size (targrows), the statistics suddenly work better. For instance, look at the data below. The first analyze / select from pg_stats is with an analyze of 3000 samples. The zipl and zipr columns get calculated poorly and can cause the planner to use a table scan instead of an index scan. The second analyze / select from the pg_stats is with an analyse of 10000 samples. The zipl and zipr n_distinct values are still off by a factor of 10, but close enough for the planner to deal. If the premise is that samples size doesn't make a difference, I think we've proved that this is not true. tiger=# analyze verbose rt1; INFO: analyzing "public.rt1" INFO: "rt1": scanned 3000 of 1527360 pages, containing 90978 live rows and 0 dead rows; 3000 rows in sample, 46318719 estimated total rows ANALYZE tiger=# select * from pg_stats where tablename = 'rt1' and attname like 'zip%'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+------------- public | rt1 | zipl | 0.672 | 4 | 960 | {76240,52601,55746,71730,74604,92705,93117,95818} | {0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333} | {1085,16652,28206,33412,43147,49428,58801,68110,77515,91340,99006} | -0.119519 public | rt1 | zipr | 0.677 | 4 | 960 | {76240,52601,55746,71730,74604,78577,92705,93117,95818} | {0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333} | {962,15613,28572,33606,43545,49428,60423,68064,77040,91340,99006} | -0.104158 (2 rows) Now this: tiger=# analyze verbose rt1; INFO: analyzing "public.rt1" INFO: "rt1": scanned 10000 of 1527360 pages, containing 303419 live rows and 0 dead rows; 10000 rows in sample, 46343004 estimated total rows ANALYZE tiger=# select * from pg_stats where tablename = 'rt1' and attname like 'zip%'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------------+-------------------------------------------------------------------------+-------------------------------------------------------------------+------------- public | rt1 | zipl | 0.6807 | 4 | 2942 | {61832,13090,17404,30907,31204,45342,47714,63050,80918,93726} | {0.0008,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} | {654,15018,28208,33870,43006,49008,59741,68803,78640,92105,99687} | -0.137744 public | rt1 | zipr | 0.684 | 4 | 2921 | {13090,61832,30907,31204,45342,47714,63050,70122,80918,93726} | {0.0006,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} | {731,14824,27871,33324,42276,48895,58401,68338,78575,92105,99654} | -0.140663 (2 rows) ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]