Hi, I was able to get my Finnish corpus project off the ground this
week with help from this group; thank you very much.
Now I've run into a small problem. After reading in the corpus of
470,000 words and breaking them into syllables, I have created a list
of all possible "nonwords" (words missing from the corpus) that
consist of 2 syllables, each consisting of a consonant and vowel --
CV for short. (I need this list as part of a research project on
language learning, where we will use possible words that don't
already exist in the language).
I was able to generate all possible CV-CV strings and check them
efficiently against entire words in the corpus using hashes. However,
I also want to make sure that the strings do not occur as substrings
of any words in the corpus. I thought grep would be perfect, but the
problem is that there are 66,000 nonwords and 470,000 words; it takes
about 4-5 secs to check each nonword (on my sad old mac laptop) with
the code pasted in below. Can anyone suggest a more efficient method
(I know my code could be much more concise, but I am primarily
interested in speed, of course)?
thank you very much,
jim
$at = 0;
while(<SYLFILE>){
chomp;
# on next line, we only care about $ortho, the orthographic string
($frq, $ortho[$at], $cv, $nsyl, $cvsyl, $orthosyl) = split;
$at++;
}
close SYLFILE;
print STDERR "READ $#ortho + 1 items from $sylfile\n";
while(<NWFILE>){
s/-//g;
# on next line, we only care about $nw, the nonword string
($nw, $s1frq, $s2frq, $ratio) = split;
@matches = grep { /$nw/ } @ortho;
unless($#matches >= 0){
print;
}
}
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/