On Mon, 11 Apr 2005 [EMAIL PROTECTED] wrote:
>Dan Bolser <[EMAIL PROTECTED]> wrote on 04/11/2005 11:50:31 AM:
>
>> On Mon, 11 Apr 2005, Dan Bolser wrote:
>>
>> >
>> >Requirement:
>> >
>> >Given two columns of a table (Column1 and Column2) of length x, return
>two
>> >columns (Column1_Scram and Column2_Scram) such that the distributions
>of
>> >values in Column1 and Column2 are preserved in Column1_Scram and
>> >Column2_Scram, but the pairs of values are randomized.
>> >
>> >
>> >Solution suggested by Shawn Green:
>> >
>> >Create a table with two columns, and populate this table with random
>pairs
>> >of primary keys picked from the original table. Additionally, allow no
>> >duplicate primary keys within either column. Select x rows from this
>> >table, and link both primary keys (the primary key pair) back to the
>> >original table to get the appropriate number of randomized pairs of
>> >Column1 and Column2.
>> >
>> >He suggests doing the above like this (more or less):
>> >
>> >OriginalTable
>> >PK A B
>> >1 a c
>> >2 a d
>> >3 b e
>> >...
>> >
>> >CREATE TEMPORARY TABLE IntermediateTable (
>> > PK1 INT NOT NULL,
>> > A CHAR(1) NOT NULL,
>> > PK2 INT NOT NULL,
>> > B CHAR(1) NOT NULL,
>> > #
>> > UNIQUE INDEX (PK1,A),
>> > UNIQUE INDEX (PK2,B)
>> >);
>> >
>> >INSERT IGNORE INTO IntermediateTable
>> >SELECT
>> > x.PK, x.A,
>> > y.PK, y.B
>> >FROM
>> > OriginalTable x,
>> > OriginalTable y
>> >ORDER BY
>> > RAND();
>> >
>> >SELECT
>> > x.A,
>> > y.B
>> >FROM
>> > IntermediateTable
>> >INNER JOIN
>> > OriginalTable x ON (PK1 = x.PK) INNER JOIN
>> > OriginalTable y ON (PK2 = y.PK)
>> >LIMIT
>> > the_length_of_OriginalTable;
>> >
>> >
>> >The problem with this solution:
>> >
>> >Its too slow on reasonable sized tables!
>>
>>
>> Their is also a problem with the way RAND() works...
>>
>> SELECT
>> x.PK, x.A,
>> y.PK, y.B
>> FROM
>> OriginalTable x,
>> OriginalTable y
>> ORDER BY
>> RAND()
>> LIMIT
>> 1;
>>
>> This takes soooooo long to pick a random row. Some cleaver 'LIMIT'
>> optimization could pick a results set almost instantly, instead of
>taking
>> in excess of half an hour with ~50,000 rows.
>>
>>
>>
>
>Let's try this. I will assume, because you used the PK hack, you have
>duplicate values in at least one of your sets. Let's cure the Rand() speed
>issue by adding a column to Original Table to hold a random number and
>eliminate the lookup problem. Since integer math is much faster than
>floating point math, we will set up this field as an integer field and
>fill it appropriately
>
>ALTER TABLE OriginalTable ADD COLUMN RandomKey INT UNSIGNED;
>
>UPDATE OriginalTable SET RandomKey = RAND()*2000000;
>
>Let's also modify IntermediateTable like this:
>
>DROP TABLE IntermediateTable;
>
>CREATE TABLE FirstColumn
> id INT auto_increment
> , a char(1)
> , PRIMARY KEY (id)
>);
>
>CREATE TABLE SecondColumn
> id INT auto_increment
> , b char(1)
> , PRIMARY KEY (id)
>);
>
>And populate the new tables:
>INSERT FirstColumn (a)
>SELECT a
>FROM OriginalTable
>ORDER BY PK1;
>
>INSERT SecondColumn (b)
>SELECT b
>FROM OriginalTable
>ORDER BY RandomKey;
>
>Then get your randomized (A,B) pairs this way:
>
>SELECT x.A, y.B
>FROM FirstColumn x
>INNER JOIN SecondColumn y
> on x.id = y.id;
>
>This should be MUCH faster than 30 mins (I would guess on the order of 2
>or 3 at most). FirstColumn gets filled with data in original order,
>SecondColumn gets filled with data in random order (thanks to the random
>value). By creating new tables to cache those values we create two new
>contiguous auto_increment runs (this way you can analyze subsets of your
>original data and not need to worry about mismatching on the final INNER
>JOIN).
>
>On the next pass, Re-run the UPDATE to assign new RAND() values to your
>data. Do not empty or refill FirstColum. Execute a "TRUNCATE TABLE
>SecondColumn;" then refill it (INSERT SecondColumn...) and repeat the
>final query.
>
>HTH!!
I finally got round to trying this out - It works really well!
I did the following...
DROP TABLE IF EXISTS rand_a;
CREATE TABLE rand_a (
PK MEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
A_id INT UNSIGNED NOT NULL,
#
INDEX (A_id)
);
DROP TABLE IF EXISTS rand_b;
CREATE TABLE rand_b (
PK MEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
B_id INT UNSIGNED NOT NULL,
#
INDEX (b_id)
);
INSERT INTO rand_a (A_id)
#
SELECT A_id FROM x ORDER BY RAND();
INSERT INTO rand_b (B_id)
#
SELECT B_id FROM x ORDER BY RAND();
I then get the full list of random pairs (with the same marginal
distribution as before) with...
SELECT *
FROM rand_a
INNER JOIN rand_b
USING (PK);
I chekcked my marginals like this...
SELECT COUNT(*), COUNT(DISTINCT A_id), COUNT(DISTINCT B_id)
FROM nrints x;
SELECT COUNT(*), COUNT(DISTINCT A_id), COUNT(DISTINCT B_id)
FROM rand_a
INNER JOIN rand_b
USING (PK);
And they were both identical - which is what I wanted! The whole thing
takes a couple of seconds on the original 'pair' table which has 20,000
rows.
Thank you!
>
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]