Hello,

I do not have access to the sort operation. All I have is a file that is "sorted" but I don't know exactly the mechanism by which it was sorted. What I am trying to do is write a comparison function--given any two lines in this file, return -1, 0, 1 as perl's cmp function does. I don't have to sort the whole file, I just need to be able to tell, given two lines, which should come before the other in the sorted ordering. (I need this to do a binary search over the file, which is quite large).

Anyhow, I'm posting the function I wrote to do this. It is quite long and to me is naive. It is also pretty slow. I have tested function in a larger script that I wrote. As a method of comparison, the script takes 23 sec to run over a 500k file using my function. When I substituted perl's cmp for my function, the same script ran in 12 sec. I'd really like to get this running more quickly as I have to run my script on files MUCH larger than 500k.

I'm thinking the way the file is sorted is something simple, yet something I don't recgonize! As a result, you'll see that my method is probably very over-complicated.

I have also posted some more data that is representative of what I have to work with. Both the comparison function I'm using and the sample data are attached if anyone wants to check it out. I appreicate any suggestions.

Dan


Rob Dixon wrote:
Dan Laflamme wrote:

I have a file that appears to be somewhat sorted, but is not sorted
according to the traditional unix sort. I'll give some examples, and if
anyone recgonizes the way in which the file is sorted, please let me
know. Also, since I may have to write a comparator function for use on
this "sorted" file, any tips on doing that in the most efficient way
possible would be helpful. Thanks.

Examples. I give some sets of strings to indicate which comes before the
other in the "sorted" file. The first one in each pair is less than the
second. Each set is independent of the others. set x doesnt necessarily
come before set y in the file if x < y.

Set 1:
0-CELEBRITY-0
0-CELEBRITY

Set 2:
0-0-7
0-0

Set 3:
000-AAA
000

Set 4:
000
0000

Set 5:
0908
09088122595
09088122595
0909-114


We could make guesses as to the sort algorithm, but there is very little data
here to go on. If you sort stringwise on the first fragment of the string up to
the dash then you will get exactly this order. But several strings have the same
first fragment (all of Set 1 and Set 2, for instance, start with the value '0').

Do you have access to the sort operation so that you can sort arbitrary data?
This can be decided properly only by seeing how contrived test data is sorted.

Rob




0-02-13.LCDAFAH-JDC.LAME-DELEGATION
0-13-02.LCDAFAH-JDC.LAME-DELEGATION
0.0-1
0.BLACKCAT.NS.CHIARK
0.NS.ORBIS
0.NS.USMAC
0.TACOMAWIRELESS
000.SERVIDORES
0000.SERVIDORES
000SGWBIN-01.SMARTGW
01010101.MENTALFLOSS
01NS01.AIR-SPEED
01NS02.AIR-SPEED
02ROBOX01.I
0803MLSS.SHIENHI
0X6E7331.0XEB
0X6E7332.0XEB
1.0-1
1.07
1.1C
1.AD4HOST
1.BLACKCAT.NS.CHIARK
1.NDB.9SRV
1.NS.NARKIEWICZ
1.NS.ORBIS
1.NS.USMAC
1.QUDSWAY
101Q-MA.MAILSCAN
105.MGOG
105.WEBWORLDS
10801NET.10801
10DEEP.PAPERKILL
10TH-PLANET.CELTIC-HOSTING
11016A.COBALTHOSTING
111.ICOMIS
12-USCOFC-DC-01.INWIRED
900DOM002.EMNW
91C
98TM
99.800GIFT
9KS.OHBO
A-93.AKADNS
A-NS.PRIMARI
A-NS.SCHALIT-WAY
A-NS.SECQR
A-S1.ANTWINES
A-S2.ANTWINES
A.BEK.JP.NS.TOKYOTOPLESS
A.BENCHTOPLABS
A.BEST-VIEW-ROOT-SERVERS
A.BINSOORE
A.BITLESS
A.BODOSOM
A.DATESEX
A.DNS-RDMEDIAS
A.DNS.HOSTWAY
A.DNS.TDS
A.E2E-TEST
A.EGPG.GEO.HOST.ZENSOFT
A.FREEMINDDESIGNS
AAI-NS1.INTRNET
AAI-NS2.INTRNET
AANDA01.AANDAELECTRIC
AANDA02.AANDAELECTRIC
AANDKCATERING4-BIZOPIA.NS1.AANDKCATERING
AANDKCATERING4-BIZOPIA.NS2.AANDKCATERING
AANS1.AAHOSTING
AARDVARK.ZOO
AARON.MATHFORGE
AARONCAKE
AAYT.NIS
AB.WIREMOUTH
ABA.ALKALAY
ABACI.RUTIL
ABACO.KENNETT
ABAGAIL.BCINET
ABBADON.COMSATVEN
ABBOTT.UPLINC
ABBY-W2K1.ABBYSENIOR
ABBY.WIREDBIZ
ABC-MARKET.ABC-MARKET
ABC-MARKET

sub compare {
    my $a = shift;
    my $b = shift;

    ##print "a: $a\n";
    ##print "b: $b\n";

    # get the indices of any delimiters
    my $adash = index($a, '-');
    my $bdash = index($b, '-');
    my $adot = index($a, '.');
    my $bdot = index($b, '.');

    ##print "adash: $adash adot: $adot\n";
    ##print "bdash: $bdash bdot: $bdot\n";

    my ($aDelim, $bDelim, $aDelimType, $bDelimType);

    if ($adash == -1 && $adot == -1) {
        # no delimiter in either string
        $aDelim = -1; $aDelimType = 4;
    } elsif ($adash == -1) {
        # a's first delimiter is a dot
        $aDelim = $adot; $aDelimType = 3;
    } elsif ($adot == -1) {
        # a's first delimiter is a dash
        $aDelim = $adash; $aDelimType = 2;
    } else {
        # smaller of the two
        if ($adash < $adot) {
            $aDelim = $adash;
            $aDelimType = 2;
        } else {
            $aDelim = $adot;
            $aDelimType = 3;
        }
    }


    if ($bdash == -1 && $bdot == -1) {
        # no delimiter in either string
        $bDelim = -1; $bDelimType = 4;
    } elsif ($bdash == -1) {
        # a's first delimiter is a dot
        $bDelim = $bdot; $bDelimType = 3;
    } elsif ($bdot == -1) {
        # a's first delimiter is a dash
        $bDelim = $bdash; $bDelimType = 2;
    } else {
        # smaller of the two
        if ($bdash < $bdot) {
            $bDelim = $bdash;
            $bDelimType = 2;
        } else {
            $bDelim = $bdot;
            $bDelimType = 3;
        }
    }


    ##print "aDelim: $aDelim\n";
    ##print "bDelim: $bDelim\n";

    if ( ($aDelim > 0) && ($bDelim > 0) ) {
        # both strings start with some text, then have a delimiter
        
        # get the text before the first delimiter
        my $asubstr = substr($a, 0, $aDelim);
        my $bsubstr = substr($b, 0, $bDelim);

        #print "asubstr: $asubstr\n";
        #print "bsubstr: $bsubstr\n";

        my $res = $asubstr cmp $bsubstr;

        if ($res == 0) {
            # if the strings are equal, we have to check which delimiter
            # comes next. If one has a dash and the other has a dot
            # we switch the result
            if (($aDelimType == 2) && ($bDelimType != 2)) {
                return -1;
            } elsif (($aDelimType != 2) && ($bDelimType == 2)) {
                return 1;
            } else {
                # recursively process the text after the delimiters
                my $aAfterDelim = substr($a, $aDelim+1);
                my $bAfterDelim = substr($b, $bDelim+1);
                return compare($aAfterDelim, $bAfterDelim);
            }
        } else {
            return $res;
        }
    }



    if (($aDelim == -1) && ($bDelim == -1)) {
        # no delimiters at all. just compare
        return $a cmp $b;
    }
    


    if ( ($aDelim == 0 ) || ($bDelim == 0)) {
        # $a, $b, or both have a delimiter at the first index

        my $a_zero = ($aDelim == 0);
        my $b_zero = ($bDelim == 0);

        if ($a_zero && $b_zero) {
            ##print "both zero\n";
            ##print "a: $a\n";
            ##print "b: $b\n";
            
            if (($aDelimType == 2) || ($bDelimType == 2)) {
                # we can have a string of "-" dash characters, but 
                # not a string of "." characters. Remove corresponding
                # characters from the string.
                while ( (index($a, '-') == 0) ) {
                    if ( (index($b, '-') == 0) ) {
                        substr($a, 0, 1, '');
                        substr($b, 0, 1, '');
                    } else {
                        last;
                    }
                 }
             }

            ##print "a: $a\n";
            ##print "b: $b\n";
            # recursively process the rest of string
            return compare($a, $b);

        } elsif ($a_zero) {
            return -1;
        } else {
            return 1;
        }
    }


    # get the string with the most number of delimiters left
    my $aDelimGreater = ($aDelim > $bDelim);

    if ( $aDelimGreater) {

        my $asubstr = substr($a, 0, $aDelim);
        ##print "asubstr: $asubstr\n";

        my $res = $asubstr cmp $b;
        if ($res == 0) {
            # if the strings are equal, we have to check which delimiter
            # comes next. If one has a dash and the other has a dot
            # we switch the result
            if (($aDelimType == 2) && ($bDelimType != 2)) {
                return -1;
            } elsif (($aDelimType != 2) && ($bDelimType == 2)) {
                return 1;
            } else {
                # the first words were equal. we have to look
                # past the delimiter
                my $aAfterDelim = substr($a, $aDelim+1);
                my $bAfterDelim = substr($b, $bDelim+1);
                
                # recursively process the rest of the string
                return compare($aAfterDelim, $bAfterDelim);
            }
        } else {
            return $res;
        }
        
    } else {

      
        my $bsubstr = substr($b, 0, $bDelim);
        ##print "bsubstr: $bsubstr\n";

        my $res = $a cmp $bsubstr;

        if ($res == 0) {
            # if the strings are equal, we have to check which delimiter
            # comes next. If one has a dash and the other has a dot
            # we switch the result
            if (($aDelimType == 2) && ($bDelimType != 2)) {
                return -1;
            } elsif (($aDelimType != 2) && ($bDelimType == 2)) {
                return 1;
            } else {
                # the first words were equal. we have to look
                # past the delimiter
                my $aAfterDelim = substr($a, $aDelim+1);
                my $bAfterDelim = substr($b, $bDelim+1);
                
                # recursively process the rest of the string 
                return compare($aAfterDelim, $bAfterDelim);
            }
        } else {
            return $res;
        }
    }

    print "a: $a\n";
    print "b: $b\n";
    die "error in compare function";
}

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Reply via email to