I have a question about using Tesseract for trying to recover some source 
code of a printed listing that most likely would have come off a line 
printer in the early 70's probably scanned in by photocopier and them more 
recently by a more modern digital scanner. 

I have two copies of the document. One the original scan and another that 
was recently scanned for me by the archive area of the University that 
houses the document. Unfortunately both have different problems! 

Here are two sample images of the same content from the two different 
documents : 

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/ocr_work/output-111.png?ref_type=heads

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/new_scan/output-107.png?ref_type=heads

Now some things were in my favour. It is computer code so much of it was 
able to guess through human translation. Its a limited subset of the 
English language ( written in Fortran IV ) and certain combinations are 
repeated over and over. 

This is the start of my human translation here : 

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/fortran.code?ref_type=heads

But what is proving to be more of a problem is that the numerical content 
of the document. This code uses lots of goto statements and jumps and jumps 
to numerical points in the left most column. It also uses a similar jump 
arrangement for format references for its input and output format printing. 

To have any hope of recovering this code I really need a way to recover the 
numerical information in this code. Most particularly the numbers in the 
left most column.  

So I wonder what a good approach with Tesseract would be? I've tried to 
watch some tutorials and to read the docs. I am comfortable in using python 
but I am not entirely sure if Tesseract fits this use case? 

Months ago I started to build a project using Tesseract but got confused by 
the different versions available. This is what I thought I should do 

- First build up an image set of as many different versions of the 
numerical content from 0-9 that I can pick out of one of the documents. 

- Put those into an image grid. Then use that image set ( with a script I 
will write in pyton) to generate as much sample image data as possible and 
matching text translation. 

- Then this is where I start to draw a bit of a blank ... :) 

I'd be greatful for any suggestions as to what the best approach is ... ! 

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/892265b0-c9ea-468e-bc00-c3d930658b3dn%40googlegroups.com.

Reply via email to