Re: statistics of text

John Doe Wed, 02 Nov 2005 08:42:17 -0800

Ing. Branislav Gerzo am Mittwoch, 2. November 2005 14.52:
> Hi all,
>
> I have quite interesting work. Example:
>
> In txt I have some words (up to 100.000) - words.txt (without line
> numbers):
> 1. foo
> 2. bar
> 3. foo bar
> 4. foo bar bar
> 5. bar foo bar
> 6. bar bar foo
> 7. foo foo bar
> 8. foo bar foo bar
> 9. foob bar
> 10.foo bars
>
> and so on...
>
> Now, I have to find all 2 words sentences with their sums in the list.
> For example for this list it could be (without reporting lines):
> "foo bar" - 5 times (lines: 3, 4, 5, 7, 8)
> "bar bar" - 2 times (lines: 4, 6)
> "bar foo" - 3 times (lines: 5, 6, 8)
> "foo foo" - 1 time (line: 7)
> "foob bar" - 1 time (line: 9)
> "foo bars" - 1 time (line: 10)
>
> I did this by hand...but anyone know how to this effectively in perl?
> I think I have to build hash of all possibilities of 2 words sentences (in
> input txt are allowed only [0-9a-z ]), in list I will have lines of
> input txt, and iterate every key in hash over array, writing value to
> hash its occurence ("foo bar" => 5)...hm ?


Here is another variant:
- combining only adjacent words
- counting all pairs per line
- keyword must be part of pair


#!/usr/bin/perl -w

use strict;
use warnings;

my $prereq=qr/\bfoo\b/; # pair must match this
my %found;

while (<DATA>) {
  /$prereq/ or next; # filtering not enough
  chomp;
  # eventually also trim line
  my @w=split /\s+/;
  next if @w < 2;
  map {$found{$w[$_].' '.$w[$_+1]}++} [EMAIL PROTECTED];
}

print join "\n",
      grep /$prereq/, # filtering the rest
      map "$_: ".$found{$_},
      keys %found


__END__
foo
bar
foo bar
foo bar bar
bar foo bar
bar bar foo
foo foo bar
foo bar foo bar
foob bar
foo bars

# prints 
foo foo: 1
foo bar: 6
bar foo: 3
foo bars: 1

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: statistics of text

Reply via email to