> Hello all, I have a file of 3,210,008 CSV > records. I need to take a random sample of > this. I tried hacking something together a > while ago, but it seemed to repeat 65,536 > different records. When I need a 5mil > sample, this creates a problem. > > Here is my old code: I know the logic > allows dups, but what would incur the > limit? I think with 500,000 samples > there wouldn't be a problem getting more > than 65536 diff records, but that number > is too ironic for me to deal with.
Don't laugh too much, but if still having data in the same order it was extracted, then this will probably surfice: open (FILE,"consumer.sample.sasdump.txt"); open (NEW,">consumer.new"); my $probability; while (<FILE>) { print NEW if $probability > rand; } close(FILE); close(NEW); __END__ Even if it doesn't it solves the problem of having duplicates. Then you can shuffle elements to get your data set. There must be a decent shuffle algorithm someplace, since I haven't thought of one yet. splicing to pop elements midway off an array just doesn't appeal. The limit at 65536 is probably a bug, it should be much higher than that :) On the other hand, maybe they never thought someone would be using such big files with that feature. Anyway, take care, Jonathan Paton __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]