Hi, I was able to get my Finnish corpus project off the ground this week with help from this group; thank you very much.

Now I've run into a small problem. After reading in the corpus of 470,000 words and breaking them into syllables, I have created a list of all possible "nonwords" (words missing from the corpus) that consist of 2 syllables, each consisting of a consonant and vowel -- CV for short. (I need this list as part of a research project on language learning, where we will use possible words that don't already exist in the language).

I was able to generate all possible CV-CV strings and check them efficiently against entire words in the corpus using hashes. However, I also want to make sure that the strings do not occur as substrings of any words in the corpus. I thought grep would be perfect, but the problem is that there are 66,000 nonwords and 470,000 words; it takes about 4-5 secs to check each nonword (on my sad old mac laptop) with the code pasted in below. Can anyone suggest a more efficient method (I know my code could be much more concise, but I am primarily interested in speed, of course)?

thank you very much,

jim

$at = 0;
while(<SYLFILE>){
  chomp;
  # on next line, we only care about $ortho, the orthographic string
  ($frq, $ortho[$at], $cv, $nsyl, $cvsyl, $orthosyl) = split;
  $at++;
}
close SYLFILE;
print STDERR "READ $#ortho + 1 items from $sylfile\n";

while(<NWFILE>){
  s/-//g;
  # on next line, we only care about $nw, the nonword string
  ($nw, $s1frq, $s2frq, $ratio) = split;
  @matches = grep { /$nw/ } @ortho;
  unless($#matches >= 0){
    print;
  }
}
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to