> [EMAIL PROTECTED] wrote: >> >> In this case, the behavior observed could be changed by altering the >> sample size for a table. I submit that an arbitrary fixed sample size is >> not a good base for the analyzer, but that the sample size should be >> based >> on the size of the table or some calculation of its deviation. >> > > Mark, > > Do you have any evidence that the Sample Size had anything to do > with the performance problem you're seeing?
I have evidence, if you look through some of the messages in this thread, you'll see how a sample size of 10000 provides enough data points to create stats the planner can use. > > I also do a lot with the complete Census/TIGER database. Cool, have any code for Mapserver? > > Every problem I have with the optimizer comes down to the > fact that the data is loaded (and ordered on disk) by > State/County FIPS codes, and then queried by zip-code > or by city name. Like this: > > Alabama 36101 [hundreds of pages with zip's in 36***] > Alaska 99686 [hundreds of pages with zip's in 9****] > Arizona 85701 [hundreds of pages with zip's in 855**] > > Note that the zip codes are *NOT* sequential. > > The "correlation" statistic sees that the Zip codes are not > sequential; so it makes the *HORRIBLE* assumption that they > are scattered randomly across the disk. It is my theory that this is because there are too few data points with which to properly characterize the nature of the data. > > In reality, even though there's no total ordering of the > zip codes; any given zip code only exists on a couple > disk pages; so index scans would be the right choice. I totally agree. > > > But the single correlation parameter is not sufficient > to let the optimizer known this. > > No matter how large a sample size you choose, ANALYZE > will correctly see that Zip codes and State FIPS codes > are non-correlated, and the optimizer will overestimate > the # of pages an index scan will need. > I tried to create an analogy in another post, and TIGER is a perfect example of the analogy. Think of the difference between an oscilloscope and a spectrum analizer. The current sampling code works more like an oscilloscope. It assumes a fairly normalized distribution of data. Given this, it works perfectly fine. When a scope is presented with an audio signal, it looks more like gibberish showing almost no correlation. When you view it in frequency domain, as with a spectrum analyzer, you can see clear patterns in the signal. Now, fortunately, we don't need any sort of absolute visualization of the data in TIGER, we only need to see that the data has many subtle trends rather than one fairly evenly distributed one. That's why more samples works. If we could do anything, I would add more statistics to the database. A standard deviation and maybe a sliding window deviation. A standard deviation might be pretty high, were as a sliding window whould show less localized deviation. Less localized deviation whould favor index scans in. Anyway, like I said. I think the expectation that the data is fairly normalized or evenly distributed works very well for data acquired over time. It is data like TIGER that is in a multiple field order, i.e. state, zipr, zipl that has complex paterns for the secondary sorts that can't be detected with too small a sample. > > PS: I pointed out workarounds in my earlier posting > in this thread. Yes, I'm using the same TIGER data > you are. > Cool. > ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])