Re: script to parse OCR output different between servers

Gary Stainburn Wed, 22 Jul 2015 09:14:14 -0700

On Wednesday 22 July 2015 16:10:18 Shlomi Fish wrote:
> Hi Gary,
>
> some comments about your code.
>
> On Wed, 22 Jul 2015 15:32:33 +0100
>
> Gary Stainburn <gary.stainb...@ringways.co.uk> wrote:
> > I've written the code below to parse a number of text page files
> > generated by Tesseract OCR software to look for a guess the most likely
> > values for VIN, Reg number and stock number for a vehicle.
> >
> > On my development site it works fine and gives the values it should.
> > However, on my live server it only returns '1' as the value.
> >
> > The only thing I can think of is that my live server is much older than
> > my development server.  My perl versions are:
> >
> > Live                v5.10.0
> > Devel       v5.18.4
>
> version 5.10.0 of perl is really old and you should definitely upgrade.
> Even 5.18.4 is somewhat old and was end-of-lifed. That put aside.
>


That is the most recent version of perl installed on my Fedora boxes using yum 
update, which I try to avoid messing with.

I will at some point be building a new server, but that will be long and 
painful as the version changes for Postgresql, Perl and PHP will all be 
significant and need extensive testing with my services.

> > Am I right, and what can I do to get the code to work on my live server
> > (other than upgrage perl)
> >
> >
> > #!/usr/bin/perl -w
>
> Don't use the -w flag:
>
> http://perl-begin.org/tutorials/bad-elements/#the-dash-w-flag

I've removed that. I'm surprised by this though, as when I used to write Perl 
regularly, I'm sure that was the norm.  As you may tell, I haven't written 
perl for a while.

>
> > # searches a series of OCR generated text files - one per page
> > # looks for sets of regex's for field contents and stores in arrays
> >
> > use warnings;
> > use strict;
>
> That's good.
>
> > use Data::Dumper;
> >
> > my %searches=('stock'=>[qr/\b([NU][LD] *\d{5})\b/],
> >               'regno'=>[qr/\b([A-Za-z]{2}\d{2}[A-Za-z]{3})\b/],
> >               'vin'=>[qr/\b(WF[0O]XX[A-Z]{6}\d{5}\b)/i,qr/\b([A-Z]
> > {6}\d{5}\b)/i]);
>
> The regex here got broken into two, possibly by your mailer. I hope it's OK
> in the original code.

Yes, this was a split by the email program. The original is all on one line

>
> > my %found;
> > my %values;
> > foreach my $file (<*.txt>) {
>
> 1. Don't call variables "file":
> http://perl-begin.org/tutorials/bad-elements/#calling-variables-file

Fair enough, but that isn't the problem

>
> 2. It would be safer to do:
>
> my @fns = glob('*.txt');

I can see how this is tidier code.

>
> foreach my $filename (@fns) {
>
> >   print "file.....$file\n";
> >   if (!open FH,$file ) {
> >     print "file open failed: $!\n";
> >     next;
> >   }
>
> 1. Maybe it's just me but I feel an indentation step of 2 spaces is too
> hard to read. I recommend increasing it.

It's just personal preference. I've found that using 2 spaces makes my code 
easier to read when you have 3 or more indents, loops, if's, etc.  Using Tab 
was a farse as the code ended up starting half way across the screen.

>
> 2. See http://perl-begin.org/tutorials/bad-elements/#open-function-style
>
> >   my $content = do { local $/; <FH> };
> >   close(FH);
>
> You really should extract it into its own slurp function or use a slurp
> module such as https://metacpan.org/pod/File::Slurper .

I made the choice to not rely on modules to make the code portable - or so I 
thought.
>
> > # print "*****$content*********\n";

placing a # in column 1 to remove debug code etc. makes it easier to search in 
gvim. Again, this isn't the issue

>
> You shouldn't put the comments right at the start of the line - indent them
> with the rest of the code.
>
> >   foreach my $field (keys %searches) { # foreach field
>
> What does the comment add?
>

The improvement in variable names made the comment redundant

> > #####THe following line is the one with the problem
> >     if (my @matches = $content =~ @{$searches{$field}}) {
>
> What do you want to achieve when using an array at the right hand side of
> =~ ? Please use any/all/map/grep/none/notall from either perlfunc or
> https://metacpan.org/pod/List::Util /
> https://metacpan.org/pod/List::MoreUtils
>

This statement was supposed to return an array of matches found by applying 
multiple regex's to the string $content. This is where the problems appear.  
On the devel box this works, whereas on the older box it returns '1' which 
I'm guessing is just 'true'

> >       foreach (@matches) {
>
> Don't iterate using $_ - use a lexical iterator:
> http://perl-begin.org/tutorials/bad-elements/#overuse_dollar_underscore
>
> >         $_=~s/ //g;
> >         print STDERR "match found - '$field': '$_'\n";
> >         if ($found{$field}{$_}) {
> >           $found{$field}{$_}++;
> >         } else {
> >           $found{$field}{$_}=1;
> >         }
>
> This can be shortened to « $found{$field}{$_}=1; » - without the if/else.
> See autovivification.

That would not give me a tally which is what I'm after.
I've changed it to just have:

$found{$field}{$_}++;

which works in perl. I wrote it long hand initialy, because in PHP which I use 
most, that throws an exception.

>
> >       }
> >     }
> >   }
> >
> > } # foreach page
> >
> >
> > foreach my $field (keys %found) { # foreach field
> >   my $value='';
> >   my $count=0;
> >   foreach my $key (keys %{$found{$field}}) { # foreach field -> value
> >     $value=$key if ($found{$field}{$key} > $count);
> >   }
>
> shouldn't you use "first" from List::Util here?

How would this select the key in %found{$field} with the highest tally?

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: script to parse OCR output different between servers

Reply via email to