Re: script to parse OCR output different between servers

Shlomi Fish Wed, 22 Jul 2015 11:33:08 -0700

Hi Gary,

see below for my comments.


On Wed, 22 Jul 2015 17:11:29 +0100
Gary Stainburn <gary.stainb...@ringways.co.uk> wrote:

> On Wednesday 22 July 2015 16:10:18 Shlomi Fish wrote:
> > Hi Gary,
> >
> > some comments about your code.
> >
> > On Wed, 22 Jul 2015 15:32:33 +0100
> >
> > Gary Stainburn <gary.stainb...@ringways.co.uk> wrote:
> > > I've written the code below to parse a number of text page files
> > > generated by Tesseract OCR software to look for a guess the most likely
> > > values for VIN, Reg number and stock number for a vehicle.
> > >
> > > On my development site it works fine and gives the values it should.
> > > However, on my live server it only returns '1' as the value.
> > >
> > > The only thing I can think of is that my live server is much older than
> > > my development server.  My perl versions are:
> > >
> > > Live              v5.10.0
> > > Devel     v5.18.4
> >
> > version 5.10.0 of perl is really old and you should definitely upgrade.
> > Even 5.18.4 is somewhat old and was end-of-lifed. That put aside.
> >
> 
> That is the most recent version of perl installed on my Fedora boxes using
> yum update, which I try to avoid messing with.
> 

You can try using perlbrew - http://perlbrew.pl/ .

> I will at some point be building a new server, but that will be long and 
> painful as the version changes for Postgresql, Perl and PHP will all be 
> significant and need extensive testing with my services.

Good luck!

> 
> > > Am I right, and what can I do to get the code to work on my live server
> > > (other than upgrage perl)
> > >
> > >
> > > #!/usr/bin/perl -w
> >
> > Don't use the -w flag:
> >
> > http://perl-begin.org/tutorials/bad-elements/#the-dash-w-flag
> 
> I've removed that. I'm surprised by this though, as when I used to write Perl 
> regularly, I'm sure that was the norm.  As you may tell, I haven't written 
> perl for a while.
> 

Well, the -w flag has been largely superseded by "use warnings;" for many years
now. The state-of-the-art can progress sometimes.

> >
> > > # searches a series of OCR generated text files - one per page
> > > # looks for sets of regex's for field contents and stores in arrays
> > >
> > > use warnings;
> > > use strict;
> >
> > That's good.
> >
> > > use Data::Dumper;
> > >
> > > my %searches=('stock'=>[qr/\b([NU][LD] *\d{5})\b/],
> > >               'regno'=>[qr/\b([A-Za-z]{2}\d{2}[A-Za-z]{3})\b/],
> > >               'vin'=>[qr/\b(WF[0O]XX[A-Z]{6}\d{5}\b)/i,qr/\b([A-Z]
> > > {6}\d{5}\b)/i]);
> >
> > The regex here got broken into two, possibly by your mailer. I hope it's OK
> > in the original code.
> 
> Yes, this was a split by the email program. The original is all on one line
> 

OK, good.

> >
> > > my %found;
> > > my %values;
> > > foreach my $file (<*.txt>) {
> >
> > 1. Don't call variables "file":
> > http://perl-begin.org/tutorials/bad-elements/#calling-variables-file
> 
> Fair enough, but that isn't the problem
> 
> >
> > 2. It would be safer to do:
> >
> > my @fns = glob('*.txt');
> 
> I can see how this is tidier code.
> 
> >
> > foreach my $filename (@fns) {
> >
> > >   print "file.....$file\n";
> > >   if (!open FH,$file ) {
> > >     print "file open failed: $!\n";
> > >     next;
> > >   }
> >
> > 1. Maybe it's just me but I feel an indentation step of 2 spaces is too
> > hard to read. I recommend increasing it.
> 
> It's just personal preference. I've found that using 2 spaces makes my code 
> easier to read when you have 3 or more indents, loops, if's, etc.  Using Tab 
> was a farse as the code ended up starting half way across the screen.
> 

I tend to prefer a 4 spaces indent step which is a good compromise. If you have
too many indent step, it is often an indication that you should extract a
function or a method.

> >
> > 2. See http://perl-begin.org/tutorials/bad-elements/#open-function-style
> >
> > >   my $content = do { local $/; <FH> };
> > >   close(FH);
> >
> > You really should extract it into its own slurp function or use a slurp
> > module such as https://metacpan.org/pod/File::Slurper .
> 
> I made the choice to not rely on modules to make the code portable - or so I 
> thought.

See http://perl-begin.org/topics/cpan/ - modules won't likely make your code
less portable - just somewhat harder to deploy , and that is only if you're not
using the right tool.

And you can always at least follow my first suggestion and write:

sub slurp
{
        .
        .
        .
}

my $content = slurp( ... );

> >
> > > # print "*****$content*********\n";
> 
> placing a # in column 1 to remove debug code etc. makes it easier to search
> in gvim. Again, this isn't the issue
> 
> >
> > You shouldn't put the comments right at the start of the line - indent them
> > with the rest of the code.
> >
> > >   foreach my $field (keys %searches) { # foreach field
> >
> > What does the comment add?
> >
> 
> The improvement in variable names made the comment redundant
> 

Ah.

> > > #####THe following line is the one with the problem
> > >     if (my @matches = $content =~ @{$searches{$field}}) {
> >
> > What do you want to achieve when using an array at the right hand side of
> > =~ ? Please use any/all/map/grep/none/notall from either perlfunc or
> > https://metacpan.org/pod/List::Util /
> > https://metacpan.org/pod/List::MoreUtils
> >
> 
> This statement was supposed to return an array of matches found by applying 
> multiple regex's to the string $content. This is where the problems appear.  
> On the devel box this works, whereas on the older box it returns '1' which 
> I'm guessing is just 'true'
> 

In that case, you should try using map: (untested):

if (my @matches = map { $content =~ $_ } @{$searches{$field}}) {

Do you also want a /$_/g here? Not sure.

> > >       foreach (@matches) {
> >
> > Don't iterate using $_ - use a lexical iterator:
> > http://perl-begin.org/tutorials/bad-elements/#overuse_dollar_underscore
> >
> > >         $_=~s/ //g;
> > >         print STDERR "match found - '$field': '$_'\n";
> > >         if ($found{$field}{$_}) {
> > >           $found{$field}{$_}++;
> > >         } else {
> > >           $found{$field}{$_}=1;
> > >         }
> >
> > This can be shortened to « $found{$field}{$_}=1; » - without the if/else.
> > See autovivification.
> 
> That would not give me a tally which is what I'm after.
> I've changed it to just have:
> 
> $found{$field}{$_}++;
> 

Yes, that's what I meant - sorry. It was a thinko on my part.

> which works in perl. I wrote it long hand initialy, because in PHP which I
> use most, that throws an exception.
> 

Ah. 

> >
> > >       }
> > >     }
> > >   }
> > >
> > > } # foreach page
> > >
> > >
> > > foreach my $field (keys %found) { # foreach field
> > >   my $value='';
> > >   my $count=0;
> > >   foreach my $key (keys %{$found{$field}}) { # foreach field -> value
> > >     $value=$key if ($found{$field}{$key} > $count);
> > >   }
> >
> > shouldn't you use "first" from List::Util here?
> 
> How would this select the key in %found{$field} with the highest tally?
> 

I don't see where you select the highest tally here and you can try using
max/max_by . $value is always overrided with $key if it's bigger than zero.

Finally note that you should do «my $ff = $found{$field};» and then use %$ff
because you're using it several times.

Regards,

        Shlomi Fish


-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Original Riddles - http://www.shlomifish.org/puzzles/

In Soviet Russia, joke is tired of you.
    — apeiron, #perl

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: script to parse OCR output different between servers

Reply via email to