I do not have access to the sort operation. All I have is a file that is "sorted" but I don't know exactly the mechanism by which it was sorted. What I am trying to do is write a comparison function--given any two lines in this file, return -1, 0, 1 as perl's cmp function does. I don't have to sort the whole file, I just need to be able to tell, given two lines, which should come before the other in the sorted ordering. (I need this to do a binary search over the file, which is quite large).
Anyhow, I'm posting the function I wrote to do this. It is quite long and to me is naive. It is also pretty slow. I have tested function in a larger script that I wrote. As a method of comparison, the script takes 23 sec to run over a 500k file using my function. When I substituted perl's cmp for my function, the same script ran in 12 sec. I'd really like to get this running more quickly as I have to run my script on files MUCH larger than 500k.
I'm thinking the way the file is sorted is something simple, yet something I don't recgonize! As a result, you'll see that my method is probably very over-complicated.
I have also posted some more data that is representative of what I have to work with. Both the comparison function I'm using and the sample data are attached if anyone wants to check it out. I appreicate any suggestions.
Dan
Rob Dixon wrote:
Dan Laflamme wrote:
I have a file that appears to be somewhat sorted, but is not sorted according to the traditional unix sort. I'll give some examples, and if anyone recgonizes the way in which the file is sorted, please let me know. Also, since I may have to write a comparator function for use on this "sorted" file, any tips on doing that in the most efficient way possible would be helpful. Thanks.
Examples. I give some sets of strings to indicate which comes before the other in the "sorted" file. The first one in each pair is less than the second. Each set is independent of the others. set x doesnt necessarily come before set y in the file if x < y.
Set 1: 0-CELEBRITY-0 0-CELEBRITY
Set 2: 0-0-7 0-0
Set 3: 000-AAA 000
Set 4: 000 0000
Set 5: 0908 09088122595 09088122595 0909-114
We could make guesses as to the sort algorithm, but there is very little data here to go on. If you sort stringwise on the first fragment of the string up to the dash then you will get exactly this order. But several strings have the same first fragment (all of Set 1 and Set 2, for instance, start with the value '0').
Do you have access to the sort operation so that you can sort arbitrary data? This can be decided properly only by seeing how contrived test data is sorted.
Rob
0-02-13.LCDAFAH-JDC.LAME-DELEGATION 0-13-02.LCDAFAH-JDC.LAME-DELEGATION 0.0-1 0.BLACKCAT.NS.CHIARK 0.NS.ORBIS 0.NS.USMAC 0.TACOMAWIRELESS 000.SERVIDORES 0000.SERVIDORES 000SGWBIN-01.SMARTGW 01010101.MENTALFLOSS 01NS01.AIR-SPEED 01NS02.AIR-SPEED 02ROBOX01.I 0803MLSS.SHIENHI 0X6E7331.0XEB 0X6E7332.0XEB 1.0-1 1.07 1.1C 1.AD4HOST 1.BLACKCAT.NS.CHIARK 1.NDB.9SRV 1.NS.NARKIEWICZ 1.NS.ORBIS 1.NS.USMAC 1.QUDSWAY 101Q-MA.MAILSCAN 105.MGOG 105.WEBWORLDS 10801NET.10801 10DEEP.PAPERKILL 10TH-PLANET.CELTIC-HOSTING 11016A.COBALTHOSTING 111.ICOMIS 12-USCOFC-DC-01.INWIRED 900DOM002.EMNW 91C 98TM 99.800GIFT 9KS.OHBO A-93.AKADNS A-NS.PRIMARI A-NS.SCHALIT-WAY A-NS.SECQR A-S1.ANTWINES A-S2.ANTWINES A.BEK.JP.NS.TOKYOTOPLESS A.BENCHTOPLABS A.BEST-VIEW-ROOT-SERVERS A.BINSOORE A.BITLESS A.BODOSOM A.DATESEX A.DNS-RDMEDIAS A.DNS.HOSTWAY A.DNS.TDS A.E2E-TEST A.EGPG.GEO.HOST.ZENSOFT A.FREEMINDDESIGNS AAI-NS1.INTRNET AAI-NS2.INTRNET AANDA01.AANDAELECTRIC AANDA02.AANDAELECTRIC AANDKCATERING4-BIZOPIA.NS1.AANDKCATERING AANDKCATERING4-BIZOPIA.NS2.AANDKCATERING AANS1.AAHOSTING AARDVARK.ZOO AARON.MATHFORGE AARONCAKE AAYT.NIS AB.WIREMOUTH ABA.ALKALAY ABACI.RUTIL ABACO.KENNETT ABAGAIL.BCINET ABBADON.COMSATVEN ABBOTT.UPLINC ABBY-W2K1.ABBYSENIOR ABBY.WIREDBIZ ABC-MARKET.ABC-MARKET ABC-MARKET
sub compare { my $a = shift; my $b = shift; ##print "a: $a\n"; ##print "b: $b\n"; # get the indices of any delimiters my $adash = index($a, '-'); my $bdash = index($b, '-'); my $adot = index($a, '.'); my $bdot = index($b, '.'); ##print "adash: $adash adot: $adot\n"; ##print "bdash: $bdash bdot: $bdot\n"; my ($aDelim, $bDelim, $aDelimType, $bDelimType); if ($adash == -1 && $adot == -1) { # no delimiter in either string $aDelim = -1; $aDelimType = 4; } elsif ($adash == -1) { # a's first delimiter is a dot $aDelim = $adot; $aDelimType = 3; } elsif ($adot == -1) { # a's first delimiter is a dash $aDelim = $adash; $aDelimType = 2; } else { # smaller of the two if ($adash < $adot) { $aDelim = $adash; $aDelimType = 2; } else { $aDelim = $adot; $aDelimType = 3; } } if ($bdash == -1 && $bdot == -1) { # no delimiter in either string $bDelim = -1; $bDelimType = 4; } elsif ($bdash == -1) { # a's first delimiter is a dot $bDelim = $bdot; $bDelimType = 3; } elsif ($bdot == -1) { # a's first delimiter is a dash $bDelim = $bdash; $bDelimType = 2; } else { # smaller of the two if ($bdash < $bdot) { $bDelim = $bdash; $bDelimType = 2; } else { $bDelim = $bdot; $bDelimType = 3; } } ##print "aDelim: $aDelim\n"; ##print "bDelim: $bDelim\n"; if ( ($aDelim > 0) && ($bDelim > 0) ) { # both strings start with some text, then have a delimiter # get the text before the first delimiter my $asubstr = substr($a, 0, $aDelim); my $bsubstr = substr($b, 0, $bDelim); #print "asubstr: $asubstr\n"; #print "bsubstr: $bsubstr\n"; my $res = $asubstr cmp $bsubstr; if ($res == 0) { # if the strings are equal, we have to check which delimiter # comes next. If one has a dash and the other has a dot # we switch the result if (($aDelimType == 2) && ($bDelimType != 2)) { return -1; } elsif (($aDelimType != 2) && ($bDelimType == 2)) { return 1; } else { # recursively process the text after the delimiters my $aAfterDelim = substr($a, $aDelim+1); my $bAfterDelim = substr($b, $bDelim+1); return compare($aAfterDelim, $bAfterDelim); } } else { return $res; } } if (($aDelim == -1) && ($bDelim == -1)) { # no delimiters at all. just compare return $a cmp $b; } if ( ($aDelim == 0 ) || ($bDelim == 0)) { # $a, $b, or both have a delimiter at the first index my $a_zero = ($aDelim == 0); my $b_zero = ($bDelim == 0); if ($a_zero && $b_zero) { ##print "both zero\n"; ##print "a: $a\n"; ##print "b: $b\n"; if (($aDelimType == 2) || ($bDelimType == 2)) { # we can have a string of "-" dash characters, but # not a string of "." characters. Remove corresponding # characters from the string. while ( (index($a, '-') == 0) ) { if ( (index($b, '-') == 0) ) { substr($a, 0, 1, ''); substr($b, 0, 1, ''); } else { last; } } } ##print "a: $a\n"; ##print "b: $b\n"; # recursively process the rest of string return compare($a, $b); } elsif ($a_zero) { return -1; } else { return 1; } } # get the string with the most number of delimiters left my $aDelimGreater = ($aDelim > $bDelim); if ( $aDelimGreater) { my $asubstr = substr($a, 0, $aDelim); ##print "asubstr: $asubstr\n"; my $res = $asubstr cmp $b; if ($res == 0) { # if the strings are equal, we have to check which delimiter # comes next. If one has a dash and the other has a dot # we switch the result if (($aDelimType == 2) && ($bDelimType != 2)) { return -1; } elsif (($aDelimType != 2) && ($bDelimType == 2)) { return 1; } else { # the first words were equal. we have to look # past the delimiter my $aAfterDelim = substr($a, $aDelim+1); my $bAfterDelim = substr($b, $bDelim+1); # recursively process the rest of the string return compare($aAfterDelim, $bAfterDelim); } } else { return $res; } } else { my $bsubstr = substr($b, 0, $bDelim); ##print "bsubstr: $bsubstr\n"; my $res = $a cmp $bsubstr; if ($res == 0) { # if the strings are equal, we have to check which delimiter # comes next. If one has a dash and the other has a dot # we switch the result if (($aDelimType == 2) && ($bDelimType != 2)) { return -1; } elsif (($aDelimType != 2) && ($bDelimType == 2)) { return 1; } else { # the first words were equal. we have to look # past the delimiter my $aAfterDelim = substr($a, $aDelim+1); my $bAfterDelim = substr($b, $bDelim+1); # recursively process the rest of the string return compare($aAfterDelim, $bAfterDelim); } } else { return $res; } } print "a: $a\n"; print "b: $b\n"; die "error in compare function"; }
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>