[tesseract-ocr] Using Tesseract on Fortran code from late 60's

Mixotricha Mon, 10 Feb 2025 02:29:24 -0800

I have a question about using Tesseract for trying to recover some source 
code of a printed listing that most likely would have come off a line 
printer in the early 70's probably scanned in by photocopier and them more 
recently by a more modern digital scanner.

I have two copies of the document. One the original scan and another that
was recently scanned for me by the archive area of the University that
houses the document. Unfortunately both have different problems!

Here are two sample images of the same content from the two different
documents :

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/ocr_work/output-111.png?ref_type=heads

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/new_scan/output-107.png?ref_type=heads

Now some things were in my favour. It is computer code so much of it was
able to guess through human translation. Its a limited subset of the
English language ( written in Fortran IV ) and certain combinations are
repeated over and over.

This is the start of my human translation here :

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/fortran.code?ref_type=heads

But what is proving to be more of a problem is that the numerical content
of the document. This code uses lots of goto statements and jumps and jumps
to numerical points in the left most column. It also uses a similar jump
arrangement for format references for its input and output format printing.

To have any hope of recovering this code I really need a way to recover the
numerical information in this code. Most particularly the numbers in the
left most column.

So I wonder what a good approach with Tesseract would be? I've tried to
watch some tutorials and to read the docs. I am comfortable in using python
but I am not entirely sure if Tesseract fits this use case?

Months ago I started to build a project using Tesseract but got confused by
the different versions available. This is what I thought I should do

- First build up an image set of as many different versions of the
numerical content from 0-9 that I can pick out of one of the documents.

- Put those into an image grid. Then use that image set ( with a script I
will write in pyton) to generate as much sample image data as possible and
matching text translation.

- Then this is where I start to draw a bit of a blank ... :)

I'd be greatful for any suggestions as to what the best approach is ... !

Thanks

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/892265b0-c9ea-468e-bc00-c3d930658b3dn%40googlegroups.com.

[tesseract-ocr] Using Tesseract on Fortran code from late 60's

Reply via email to