On 4/1/06, Frank Bax <[EMAIL PROTECTED]> wrote: > > At 06:59 PM 3/31/06, Mr. Shawn H. Corey wrote: > > > > >On Fri, 2006-31-03 at 15:45 -0800, Tom Phoenix wrote: > > > > You should loop over the input, pushing each item on to an array. If > > > > at any time you have 2000 items in the array, sort them and discard > > > > any you don't want to keep. > > > > > > > > $#data = 999 if $#data > 999; # OBperl: one way to discard > > > elements > > > > > > > > When you finish looping, sort-and-discard again. You'll never need > > > > more than 2N items in memory at any given time. Does that algorithm > > > > work for your needs? > > > > > >Sorting is not necessary. If he keeps an array of the best, that means > > >lowest, records then all he has to do is compare every new entry with > > >the highest. This is called "fail early." This means, if it's going to > > >fail, it should at the earliest opportunity. If it succeeds then it > > >searches down thru the list, to find its place. This is called "succeed > > >early." Given that the procedure can flip between these two methods, it > > >is faster than any sort. > > > > > > I'm not the OP, but I have a script with a similar problem. The script has > > some logic that generates many (thousands of billions) of combinations from > > a little bit of data and only the best 100 combos are output. For each > > combination produced by script, a value is calculated. I maintain an array > > of "best" results so far; if size of this array is <100, then a "new" > > result is automatically added. If there are already 100 items, I find the > > "worst" entry in the array. If the "new" entry is better than the "worst", > > the "new" one replaces the "worst". > > > > I am not sure how your reference to "fail early" and "succeed early" apply > > to this situation. > > > > At the moment, the array is left unsorted. If I use a sorted array, it > > needs to resorted every time a "worst" entry is replaced by a "new" > > entry. Can I avoid sorting the array every iteration? How else to find > > the "worst" entry? If I replace "worst" with a "new" entry, doesn't the > > array need to be resorted? How does the cost of searching an unsorted > > array compare to resorting on every update and searching a sorted array. > > > > Tom's idea about growing to 200, then chopping back to 100 also sounds > > interesting. > > > > Sorting is expensive, but keeping a sorted array isn't always, > especially if you know that the majority of your data will fall > outside your window, preferrably high. I'd probably do something like > the following: > > #!/usr/bin/perl > > use warnings; > use strict; > > my $keeprecs = 10; # the number of records you want to keep > my @data = (10,24,5,32,4,9,8,7,6,1,2,0,3); > my @array; > > while (@data) { # probably while <> in production > my $number = pop @data ; > if ($number > $array[-1]) { > next if @array >= $keeprecs; > # fail early if the number is high and we > # already have $keeprecs records saved > push @array, $number; > } elsif ($number < $array[0]) { > unshift @array, $number; > } else { > my($i, $idx); > for ($i = 0; $i < @array; $i++) { > next if $i == 0; > if (($number < $array[$i]) and ($number > $array[$i-1])) { > splice @array,$i,0,$number; > last; > } > } > } > pop @array if @array > $keeprecs; > } > > print "@array\n"; > I am going to give this a try, but the data is truly a random set of numbers and whether most will be hi or lo, I will leave that for the processing. Originally it took a little over 3 hours to do the processing.
I have added three arrays and one hash. a) array 1 holds the numberbs being compared ( sorted ) b) array 2 holds a key to the hash of data needed later associated with each number c) array 3 has a set of 2000 keys I can rotate through as I get hits and use the keys. If key is dropped then I push back on the array and I have at twice the number needed, so should never have aproblem. d) hash data associated w a) ranges from 60 to 800 bytes per number. Again thanks to the list for the help. Shawn, I looked at your response, but did not truly understand, so depending on what happens here, I may need to give yours a short, but I was not following it. Sorry. Wags ;) > The efficiency here will depend on the distribution of the data. If > every new piece of data is > $array[-2] and < $array[-1], then it's > something on the order of O($keeprecs * N). In the best case scenario, > with all records falling above the current max, it's O(N). In the real > world, who knows, but but it should be reasonably fast. > > In any case, the key to a project like this is to figure out quickly > whether you want to keep a particular peice of data, and spend as > little time as possible on the out of bounds data which, if you want > 1000 records out of 1,000,000 should be the vast majority of it. > > HTH > -- jay ******************************************************* This message contains information that is confidential and proprietary to FedEx Freight or its affiliates. It is intended only for the recipient named and for the express purpose(s) described therein. Any other use is prohibited. ******************************************************* -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>