On Jun 22, 5:48 pm, [EMAIL PROTECTED] (Andrej Kastrin) wrote:
> Dear all,
>
> I wrote a simple sql querry to count co-occurrences between words but it
> performs very very slow on large datasets. So, it's time to do it with
> Perl. I need just a short tip to start out: which structure to use to
> count all possible occurrences between letters (e.g. A, B and C) under
> the particular document number. My dataset looks like following:
>
> 1 A
> 1 B
> 1 C
> 1 B
> 2 A
> 2 A
> 2 B
> 2 C
> etc. till doc. number 100.000
>
> The result file should than be similar to:
> A B 4   ### 2 co-occurrences under doc. number 1 + 2 co-occurrences
> under doc. number 2
> A C 3   ### 1 co-occurrence under doc. number 1 + 2 co-occurrences under
> doc. number 2
> B C 3   ### 2 co-occurrences under doc. number 1 + 1 co-occurrence under
> doc. number 2
>
> Thanks in advance for any pointers.
>
> Best, Andrej

use strict;
use warnings;
my %pairs;

{
    my ($prev_doc_id,%word_count,$doc_id);

    # I've written this inner-sub as anonymous even though in this
    # simple example script there's no outer sub. In the general case
    # there would be and outer sub and perl doesn't implement nested
    # nonymous subs.

    my $add_to_total = sub {
        for my $first ( keys %word_count ) {
            for my $second ( keys %word_count ) {
                unless ( $first eq $second ) {
                    $pairs{"$first $second"} +=
                        $word_count{$first} * $word_count{$second};
                }
            }
        }
        %word_count=();
        $prev_doc_id = $doc_id;
    };

    while(<DATA>) {
        ( $doc_id, my $word) = /^(\d+) (\w+)$/ or die;
        $add_to_total->()
            unless defined $prev_doc_id && $doc_id eq $prev_doc_id;
        $word_count{$word}++;
    }

    $add_to_total->();
}

for ( sort keys %pairs ) {
    print "$_ $pairs{$_}\n";
}
__DATA__
1 A
1 B
1 C
1 B
2 A
2 A
2 B
2 C


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to