grep is too slow...

Jim Magnuson Sun, 14 Jan 2007 08:40:21 -0800

Hi, I was able to get my Finnish corpus project off the ground thisweek with help from this group; thank you very much.

Now I've run into a small problem. After reading in the corpus of470,000 words and breaking them into syllables, I have created a listof all possible "nonwords" (words missing from the corpus) thatconsist of 2 syllables, each consisting of a consonant and vowel --CV for short. (I need this list as part of a research project onlanguage learning, where we will use possible words that don'talready exist in the language).

I was able to generate all possible CV-CV strings and check themefficiently against entire words in the corpus using hashes. However,I also want to make sure that the strings do not occur as substringsof any words in the corpus. I thought grep would be perfect, but theproblem is that there are 66,000 nonwords and 470,000 words; it takesabout 4-5 secs to check each nonword (on my sad old mac laptop) withthe code pasted in below. Can anyone suggest a more efficient method(I know my code could be much more concise, but I am primarilyinterested in speed, of course)?


thank you very much,

jim

$at = 0;
while(<SYLFILE>){
  chomp;
  # on next line, we only care about $ortho, the orthographic string
  ($frq, $ortho[$at], $cv, $nsyl, $cvsyl, $orthosyl) = split;
  $at++;
}
close SYLFILE;
print STDERR "READ $#ortho + 1 items from $sylfile\n";

while(<NWFILE>){
  s/-//g;
  # on next line, we only care about $nw, the nonword string
  ($nw, $s1frq, $s2frq, $ratio) = split;
  @matches = grep { /$nw/ } @ortho;
  unless($#matches >= 0){
    print;
  }
}

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

grep is too slow...

Reply via email to