Re: script to parse OCR output different between servers

Shlomi Fish Wed, 22 Jul 2015 08:39:06 -0700

Hi Gary,

some comments about your code.


On Wed, 22 Jul 2015 15:32:33 +0100
Gary Stainburn <gary.stainb...@ringways.co.uk> wrote:

> I've written the code below to parse a number of text page files generated by 
> Tesseract OCR software to look for a guess the most likely values for VIN, 
> Reg number and stock number for a vehicle.
> 
> On my development site it works fine and gives the values it should.
> However, on my live server it only returns '1' as the value.
> 
> The only thing I can think of is that my live server is much older than my 
> development server.  My perl versions are:
> 
> Live          v5.10.0
> Devel v5.18.4

version 5.10.0 of perl is really old and you should definitely upgrade. Even
5.18.4 is somewhat old and was end-of-lifed. That put aside.

> 
> Am I right, and what can I do to get the code to work on my live server
> (other than upgrage perl)
> 
> 
> #!/usr/bin/perl -w

Don't use the -w flag:

http://perl-begin.org/tutorials/bad-elements/#the-dash-w-flag

> 
> # searches a series of OCR generated text files - one per page
> # looks for sets of regex's for field contents and stores in arrays
> 
> use warnings;
> use strict;

That's good.

> use Data::Dumper;
> 
> my %searches=('stock'=>[qr/\b([NU][LD] *\d{5})\b/],
>               'regno'=>[qr/\b([A-Za-z]{2}\d{2}[A-Za-z]{3})\b/],
>               'vin'=>[qr/\b(WF[0O]XX[A-Z]{6}\d{5}\b)/i,qr/\b([A-Z]
> {6}\d{5}\b)/i]);

The regex here got broken into two, possibly by your mailer. I hope it's OK in
the original code.

> my %found;
> my %values;
> foreach my $file (<*.txt>) {

1. Don't call variables "file":
http://perl-begin.org/tutorials/bad-elements/#calling-variables-file

2. It would be safer to do:

my @fns = glob('*.txt');

foreach my $filename (@fns) {
>   print "file.....$file\n";
>   if (!open FH,$file ) {
>     print "file open failed: $!\n";
>     next;
>   }

1. Maybe it's just me but I feel an indentation step of 2 spaces is too hard to
read. I recommend increasing it.

2. See http://perl-begin.org/tutorials/bad-elements/#open-function-style

>   my $content = do { local $/; <FH> };
>   close(FH);

You really should extract it into its own slurp function or use a slurp module
such as https://metacpan.org/pod/File::Slurper .

> # print "*****$content*********\n";
> 

You shouldn't put the comments right at the start of the line - indent them
with the rest of the code.

>   foreach my $field (keys %searches) { # foreach field

What does the comment add?

> #####THe following line is the one with the problem
>     if (my @matches = $content =~ @{$searches{$field}}) {

What do you want to achieve when using an array at the right hand side of =~ ?
Please use any/all/map/grep/none/notall from either perlfunc or
https://metacpan.org/pod/List::Util / https://metacpan.org/pod/List::MoreUtils

>       foreach (@matches) {

Don't iterate using $_ - use a lexical iterator:
http://perl-begin.org/tutorials/bad-elements/#overuse_dollar_underscore

>         $_=~s/ //g;
>         print STDERR "match found - '$field': '$_'\n";
>         if ($found{$field}{$_}) {
>           $found{$field}{$_}++;
>         } else {
>           $found{$field}{$_}=1;
>         }

This can be shortened to « $found{$field}{$_}=1; » - without the if/else. See
autovivification.

>       }
>     }
>   }
>   
> } # foreach page
> 
> 
> foreach my $field (keys %found) { # foreach field
>   my $value='';
>   my $count=0;
>   foreach my $key (keys %{$found{$field}}) { # foreach field -> value
>     $value=$key if ($found{$field}{$key} > $count);
>   }

shouldn't you use "first" from List::Util here?

Regards,

        Shlomi Fish

-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Parody of "The Fountainhead" - http://shlom.in/towtf

SMG: It was 1997-1998ish, Buffy started airing. So one day a group of
yeshivah pupils arrived to the studios saying they have some numereological
insights from the Jewish bible, about what will happen in Sunnydale next.
    — http://www.shlomifish.org/humour/Summerschool-at-the-NSA/

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: script to parse OCR output different between servers

Reply via email to