> Maybe I am missing something - ISTM that you can increase your > statistics target for those larger tables to obtain a larger (i.e. > better) sample.
No one is arguing that you can't manually do things, but I am not the first to notice this. I saw the query planner doing something completely stupid and set off to discover why. Think about the person using PostgreSQL for the first time. He/she does not know about this stuff. Even if they've read the FAQs and the manual cover to cover, it will take them some time to figure out it all works together. PostgreSQL is a big system, and this is exactly why MySQL gets better marks from newbes. In this case, the behavior observed could be changed by altering the sample size for a table. I submit that an arbitrary fixed sample size is not a good base for the analyzer, but that the sample size should be based on the size of the table or some calculation of its deviation. There is no reason why old stats can't be used to create more accurate stats. Using succesive analyze operations, we could create better statistics for the planner. We can increase the sample size based on the table size. We could, I suppose, also calculate some sort of deviation statistic so that "n_distinct" can be calculated better with a smaller sample set. The basic problem, though, is that PostgreSQL performed incorrectly on a simple query after indexes were created and analyze performed. Yes, it can be corrected, that's what led me to my conclusions, but shouldn't we try to devise a better system in the future to improve PostgreSQL so it does not need this sort of tuning? > > regards > > Mark > > [EMAIL PROTECTED] wrote: >>>[EMAIL PROTECTED] writes: >> Any and all random sampling assumes a degree of uniform distribution. >> This >> is the basis of the model. It assumes that chunks of the whole will be >> representative of the whole (to some degree). This works when normal >> variations are more or less distributed uniformly. As variations and >> trends becomes less uniformly distributed, more samples are required to >> characterize it. >> >> Douglas Adams had a great device called the "Total Perspective Vortex" >> which infered the whole of the universe from a piece of fairy cake. It >> was >> a subtle play on the absurd notion that a very small sample could lead >> to >> an understanding of an infinitly larger whole. >> >> On a very basic level, why bother sampling the whole table at all? Why >> not >> check one block and infer all information from that? Because we know >> that >> isn't enough data. In a table of 4.6 million rows, can you say with any >> mathmatical certainty that a sample of 100 points can be, in any way, >> representative? >> >> Another problem with random sampling is trend analysis. Often times >> there >> are minor trends in data. Ron pointed out the lastname firstname trend. >> Although there seems to be no correlation between firstnames in the >> table, >> there are clearly groups or clusters of ordered data that is an ordering >> that is missed by too small a sample. >> >> I understand why you chose the Vitter algorithm, because it provides a >> basically sound methodology for sampling without knowledge of the size >> of >> the whole, but I think we can do better. I would suggest using the >> current >> algorithm the first time through, then adjust the number of samples [n] >> based on the previous estimate of the size of the table [N]. Each >> successive ANALYZE will become more accurate. The Vitter algorithm is >> still useful as [N] will always be an estimate. >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly