On Jun 22, 12:48 pm, [EMAIL PROTECTED] (Andrej Kastrin) wrote: > I wrote a simple sql querry to count co-occurrences between words but it > performs very very slow on large datasets. So, it's time to do it with > Perl. I need just a short tip to start out: which structure to use to > count all possible occurrences between letters (e.g. A, B and C) under > the particular document number. My dataset looks like following: > > 1 A > 1 B > 1 C > 1 B > 2 A > 2 A > 2 B > 2 C > etc. till doc. number 100.000 > > The result file should than be similar to: > A B 4 ### 2 co-occurrences under doc. number 1 + 2 co-occurrences > under doc. number 2 > A C 3 ### 1 co-occurrence under doc. number 1 + 2 co-occurrences under > doc. number 2 > B C 3 ### 2 co-occurrences under doc. number 1 + 1 co-occurrence under > doc. number 2
Maybe I'm just a little slow on the uptake, but I don't at all understand the correlation between your sample input and sample output. Where did "A B 4" come from, and what does it mean for "2 co- ocurrences" under doc number 1? What is a co-occurrence? I see one instance of "1 A", and two instances of "1 B". How does that translate to "2 co-ocurrences" of "A B"? Can you explain your desired goal a little better? Paul Lalli -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/