I have a question about using Tesseract for trying to recover some source code of a printed listing that most likely would have come off a line printer in the early 70's probably scanned in by photocopier and them more recently by a more modern digital scanner.
I have two copies of the document. One the original scan and another that was recently scanned for me by the archive area of the University that houses the document. Unfortunately both have different problems! Here are two sample images of the same content from the two different documents : https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/ocr_work/output-111.png?ref_type=heads https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/new_scan/output-107.png?ref_type=heads Now some things were in my favour. It is computer code so much of it was able to guess through human translation. Its a limited subset of the English language ( written in Fortran IV ) and certain combinations are repeated over and over. This is the start of my human translation here : https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/fortran.code?ref_type=heads But what is proving to be more of a problem is that the numerical content of the document. This code uses lots of goto statements and jumps and jumps to numerical points in the left most column. It also uses a similar jump arrangement for format references for its input and output format printing. To have any hope of recovering this code I really need a way to recover the numerical information in this code. Most particularly the numbers in the left most column. So I wonder what a good approach with Tesseract would be? I've tried to watch some tutorials and to read the docs. I am comfortable in using python but I am not entirely sure if Tesseract fits this use case? Months ago I started to build a project using Tesseract but got confused by the different versions available. This is what I thought I should do - First build up an image set of as many different versions of the numerical content from 0-9 that I can pick out of one of the documents. - Put those into an image grid. Then use that image set ( with a script I will write in pyton) to generate as much sample image data as possible and matching text translation. - Then this is where I start to draw a bit of a blank ... :) I'd be greatful for any suggestions as to what the best approach is ... ! Thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/892265b0-c9ea-468e-bc00-c3d930658b3dn%40googlegroups.com.