Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-30 Thread Pavel Stehule
2007/10/30, vincent <[EMAIL PROTECTED]>: > > 2007/10/26, Patrick TJ McPhee <[EMAIL PROTECTED]>: > >> In article <[EMAIL PROTECTED]>, cluster <[EMAIL PROTECTED]> > >> wrote: > >> % > How important is true randomness? > >> % > >> % The goal is an even distribution but currently I have not seen any w

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-30 Thread vincent
> 2007/10/26, Patrick TJ McPhee <[EMAIL PROTECTED]>: >> In article <[EMAIL PROTECTED]>, cluster <[EMAIL PROTECTED]> >> wrote: >> % > How important is true randomness? >> % >> % The goal is an even distribution but currently I have not seen any way >> % to produce any kind of random sampling effici

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-30 Thread Pavel Stehule
2007/10/26, Patrick TJ McPhee <[EMAIL PROTECTED]>: > In article <[EMAIL PROTECTED]>, cluster <[EMAIL PROTECTED]> wrote: > % > How important is true randomness? > % > % The goal is an even distribution but currently I have not seen any way > % to produce any kind of random sampling efficiently. Not

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-29 Thread Patrick TJ McPhee
In article <[EMAIL PROTECTED]>, cluster <[EMAIL PROTECTED]> wrote: % > How important is true randomness? % % The goal is an even distribution but currently I have not seen any way % to produce any kind of random sampling efficiently. Notice the word How about generating the ctid randomly? You

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-26 Thread cluster
All you're doing is picking random =subsequences= from the same permutation of the original data. You have some good points in your reply. I am very much aware of this non-random behavior you point out for the "static random-value column" approach but at least it is fast, which is a requiremen

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-25 Thread John D. Burger
As far as I can tell, all of the proposed solutions lack sample independence. Take the OP's suggested approach of doing something like this: SELECT * FROM mydata WHERE mydata.random_number >= (SELECT RANDOM() OFFSET 0) ORDER BY mydata.random_number ASC LIMIT 100 All you're doing is pi

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-24 Thread Scott Marlowe
Here's how I would do it. This assumes a static table that doesn't change a lot. 1: find the row count n of the table. 2: randomly assign 1 through n to each row randomly. How to do this is a whole not post. 3: create a sequence. If you always need 10 or 100 random rows, set the increment to t

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-24 Thread Paul Tillotson
cluster wrote: It has been suggested [1] that a good way to select K random rows efficiently from a table is to 1) add a new column, random_number, to the table and initialize it with random() 2) perform the following query: SELECT * FROM mydata WHERE random_number

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-24 Thread cluster
How important is true randomness? The goal is an even distribution but currently I have not seen any way to produce any kind of random sampling efficiently. Notice the word "efficiently". The naive way of taking a random sample of size K: (SELECT * FROM mydata ORDER BY random() LIMIT ) is

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-24 Thread Martijn van Oosterhout
On Wed, Oct 24, 2007 at 10:59:46AM +0200, cluster wrote: > Another way to look at the problem is: How do I sample a subset of size > K efficiently? A query like > >SAMPLE 1000 OF >(SELECT * FROM mydata WHERE ) How important is true randomness? To get the best possible distribution most a

Re: [GENERAL] Selecting K random rows - efficiently!

2007-10-24 Thread cluster
Another way to look at the problem is: How do I sample a subset of size K efficiently? A query like SAMPLE 1000 OF (SELECT * FROM mydata WHERE ) should return 1000 random rows from the select statement so that two consecutive evaluations of the query would only with very little probabil

[GENERAL] Selecting K random rows - efficiently!

2007-10-24 Thread cluster
It has been suggested [1] that a good way to select K random rows efficiently from a table is to 1) add a new column, random_number, to the table and initialize it with random() 2) perform the following query: SELECT * FROM mydata WHERE random_number >= (SELECT RANDO