On Jun 22, 5:48 pm, [EMAIL PROTECTED] (Andrej Kastrin) wrote: > Dear all, > > I wrote a simple sql querry to count co-occurrences between words but it > performs very very slow on large datasets. So, it's time to do it with > Perl. I need just a short tip to start out: which structure to use to > count all possible occurrences between letters (e.g. A, B and C) under > the particular document number. My dataset looks like following: > > 1 A > 1 B > 1 C > 1 B > 2 A > 2 A > 2 B > 2 C > etc. till doc. number 100.000 > > The result file should than be similar to: > A B 4 ### 2 co-occurrences under doc. number 1 + 2 co-occurrences > under doc. number 2 > A C 3 ### 1 co-occurrence under doc. number 1 + 2 co-occurrences under > doc. number 2 > B C 3 ### 2 co-occurrences under doc. number 1 + 1 co-occurrence under > doc. number 2 > > Thanks in advance for any pointers. > > Best, Andrej
use strict; use warnings; my %pairs; { my ($prev_doc_id,%word_count,$doc_id); # I've written this inner-sub as anonymous even though in this # simple example script there's no outer sub. In the general case # there would be and outer sub and perl doesn't implement nested # nonymous subs. my $add_to_total = sub { for my $first ( keys %word_count ) { for my $second ( keys %word_count ) { unless ( $first eq $second ) { $pairs{"$first $second"} += $word_count{$first} * $word_count{$second}; } } } %word_count=(); $prev_doc_id = $doc_id; }; while(<DATA>) { ( $doc_id, my $word) = /^(\d+) (\w+)$/ or die; $add_to_total->() unless defined $prev_doc_id && $doc_id eq $prev_doc_id; $word_count{$word}++; } $add_to_total->(); } for ( sort keys %pairs ) { print "$_ $pairs{$_}\n"; } __DATA__ 1 A 1 B 1 C 1 B 2 A 2 A 2 B 2 C -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/