On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost <sfr...@snowman.net> wrote: > Qi, > > * Qi Huang (huangq...@hotmail.com) wrote: >> > Doing it 'right' certainly isn't going to be simply taking what Neil did >> > and updating it, and I understand Tom's concerns about having this be >> > more than a hack on seqscan, so I'm a bit nervous that this would turn >> > into something bigger than a GSoC project. >> >> As Christopher Browne mentioned, for this sampling method, it is not >> possible without scanning the whole data set. It improves the sampling >> quality but increases the sampling cost. I think it should also be using >> only for some special sampling types, not for general. The general sampling >> methods, as in the SQL standard, should have only SYSTEM and BERNOULLI >> methods. > > I'm not sure what sampling method you're referring to here. I agree > that we need to be looking at implementing the specific sampling methods > listed in the SQL standard. How much information is provided in the > standard about the requirements placed on these sampling methods? Does > the SQL standard only define SYSTEM and BERNOULLI? What do the other > databases support? What does SQL say the requirements are for 'SYSTEM'?
Well, there may be cases where the quality of the sample isn't terribly important, it just needs to be "reasonable." I browsed an article on the SYSTEM/BERNOULLI representations; they both amount to simple picks of tuples. - BERNOULLI implies picking tuples with a specified probability. - SYSTEM implies picking pages with a specified probability. (I think we mess with this in ways that'll be fairly biased in view that tuples mayn't be of uniform size, particularly if Slightly Smaller strings stay in the main pages, whilst Slightly Larger strings get TOASTed...) I get the feeling that this is a somewhat-magical feature (in that users haven't much hope of understanding in what ways the results are deterministic) that is sufficiently "magical" that anyone serious about their result sets is likely to be unhappy to use either SYSTEM or BERNOULLI. Possibly the forms of sampling that people *actually* need, most of the time, are more like Dollar Unit Sampling, which are pretty deterministic, in ways that mandate that they be rather expensive (e.g. - guaranteeing Seq Scan). -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers