=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulc...@wulczer.org> writes: > We follow the algorithm as written, the trouble starts when we want to > output the result. The paper says which items from the D structure > should be returned when the user asks for items that have frequencies > higher than a threshold s. What we want to put in the statistics table > are items accompanied by their frequencies, so we need to do some > reasoning before we can construct the result.
Well, the estimated frequency is still just f/N. The point is that we must filter out items with small f values because they're probably inaccurate --- in particular, anything with f < eN is completely untrustworthy. I agree that we currently aren't bothering to determine a specific s value, but we probably need to do that in order to have a clear understanding of what we are getting. The idea that I was toying with is to assume a Zipfian distribution of the input (with some reasonable parameter), and use that to estimate what the frequency of the K'th element will be, where K is the target number of MCV entries or perhaps a bit more. Then use that estimate as the "s" value, and set e = s/10 or so, and then w = 1/e and continue as per the paper. If the eventual filtering results in a lot less than the target number of MCV entries (because the input wasn't so Zipfian), we lose, but at least we have accurate numbers for the entries we kept. > I we should definitely prune the table one last time in the very > probable case that the loop ended in the middle of two regularly > happening prunes. The paper doesn't say that you need to do that. I suspect if you work through the math, you'll find out that the minimum-f filter skips anything that would have been pruned anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers