On 4/1/06, Frank Bax <[EMAIL PROTECTED]> wrote:
> At 06:59 PM 3/31/06, Mr. Shawn H. Corey wrote:
>
> >On Fri, 2006-31-03 at 15:45 -0800, Tom Phoenix wrote:
> > > You should loop over the input, pushing each item on to an array. If
> > > at any time you have 2000 items in the array, sort them and discard
> > > any you don't want to keep.
> > >
> > >     $#data = 999 if $#data > 999;        # OBperl: one way to discard
> > elements
> > >
> > > When you finish looping, sort-and-discard again. You'll never need
> > > more than 2N items in memory at any given time. Does that algorithm
> > > work for your needs?
> >
> >Sorting is not necessary. If he keeps an array of the best, that means
> >lowest, records then all he has to do is compare every new entry with
> >the highest. This is called "fail early." This means, if it's going to
> >fail, it should at the earliest opportunity. If it succeeds then it
> >searches down thru the list, to find its place. This is called "succeed
> >early." Given that the procedure can flip between these two methods, it
> >is faster than any sort.
>
>
> I'm not the OP, but I have a script with a similar problem.  The script has
> some logic that generates many (thousands of billions) of combinations from
> a little bit of data and only the best 100 combos are output. For each
> combination produced by script, a value is calculated.  I maintain an array
> of "best" results so far; if size of this array is <100, then a "new"
> result is automatically added.  If there are already 100 items, I find the
> "worst" entry in the array.  If the "new" entry is better than the "worst",
> the "new" one replaces the "worst".
>
> I am not sure how your reference to "fail early" and "succeed early" apply
> to this situation.
>
> At the moment, the array is left unsorted.  If I use a sorted array, it
> needs to resorted every time a "worst" entry is replaced by a "new"
> entry.  Can I avoid sorting the array every iteration?  How else to find
> the "worst" entry?  If I replace "worst" with a "new" entry, doesn't the
> array need to be resorted?  How does the cost of searching an unsorted
> array compare to resorting on every update and searching a sorted array.
>
> Tom's idea about growing to 200, then chopping back to 100 also sounds
> interesting.
>

Sorting is expensive, but keeping a sorted array isn't always,
especially if you know that the majority of your data will fall
outside your window, preferrably high. I'd probably do something like
the following:

#!/usr/bin/perl

use warnings;
use strict;

my $keeprecs = 10; # the number of records you want to keep
my @data = (10,24,5,32,4,9,8,7,6,1,2,0,3);
my @array;

while (@data) { # probably while <> in production
    my $number = pop @data ;
    if ($number > $array[-1]) {
        next if @array >= $keeprecs;
            # fail early if the number is high and we
            # already have $keeprecs records saved
        push @array, $number;
    } elsif ($number < $array[0]) {
        unshift @array, $number;
    } else {
        my($i, $idx);
        for ($i = 0; $i < @array; $i++) {
            next if $i == 0;
            if (($number < $array[$i]) and ($number > $array[$i-1])) {
                splice @array,$i,0,$number;
                last;
            }
        }
    }
    pop @array if @array > $keeprecs;
}

print "@array\n";

The efficiency here will depend on the distribution of the data. If
every new piece of data is > $array[-2] and < $array[-1], then it's
something on the order of O($keeprecs * N). In the best case scenario,
with all records falling above the current max, it's O(N). In the real
world, who knows, but but it should be reasonably fast.

In any case, the key to a project like this is to figure out quickly
whether you want to keep a particular peice of data, and spend as
little time as possible on the out of bounds data which, if you want
1000 records out of 1,000,000 should be the vast majority of it.

HTH

-- jay
--------------------------------------------------
This email and attachment(s): [  ] blogable; [ x ] ask first; [  ]
private and confidential

daggerquill [at] gmail [dot] com
http://www.tuaw.com  http://www.dpguru.com  http://www.engatiki.org

values of β will give rise to dom!

Reply via email to