Re: [tesseract-ocr] Using Tesseract on Fortran code from late 60's

Graham Toal Tue, 11 Feb 2025 17:09:31 -0800

On Tue, Feb 11, 2025 at 5:52 PM Mixotricha <connolly.dam...@gmail.com>
wrote:


> Thanks that is a really helpful link. Unfortunately I do not have much
> chance of getting better documents. The second scan came from a helpful
> archivist at an installation that requires a classification to enter.
> Otherwise I would literally get on a plane and go and look myself. I was
> gratified that they were as helpful as they were. Really the halting point
> in this translation is not the human words. It is the jump vectors ( the
> goto statements ) and so now I am back to seeing if I can figure out some
> sort of relationship in the jump vectors in the left hand column.
> Unfortunately they do not match the line numbers on the right hand side.
> But maybe I have just not figured out what that relationship might be.
> Basically back to searching for context. Some other things in my favour are
> that the thesis itself is an excellent piece of work really well explained
> and has what are basically unit tests included that are themselves quite
> legible. I feel getting this code back is right on the edge of possibility
> if I just think about it a bit more.
>

I sympathise on the access problem - we submitted a bunch of listings and
docs to our local museum for safe keeping and haven't seen it since.  I
guess they're being kept very safe :-/

But don't give up hope on getting better access.  I was quite impressed
that the folks working on restoring the Bloodhound at bmpg.org.uk were able
to get access to the original Coral66 source code.  I myself managed to get
the MOD's Defence Procurement Agency to give me permission to post the
Coral 66 manual, just by asking via the contact page at the HMSO.  So you
never know... sometimes these people can be surprisingly reasonable.

So my fixed-pitch stuff isn't going to help you.  I have two other
suggestions: 1) classic re-keying by 2 or 3 independent people.  (if 2,
then someone has to go over the differences and explicitly make a
selection; if 3, use a 2 out of 3 consensus to pick the preferred version.
Neither is foolproof but does considerably lower the rate of errors.);  and
2) there's some experimental dewarping software worth trying such as
https://mzucker.github.io/2016/08/15/page-dewarping.html which might be
better than the sort of sortware used in things like CZUR scanners that
have a very specific model of a V shaped spine between pages of a book.

Looking at your hand-tidied source I would expect that a custom fortran
parser could find a lot of corrections, simply by keeping a name and
frequency table of variables - to catch things like CCMREG vs COMREG for
example and automatically suggesting the preferred version.  I found that a
hacked-up parser for Algol 60 was extremely helpful at that sort of
correction, leaving only a few minor errors to catch using a real compiler
once the sources were cleaned up enough to be compilable.

Good luck with your project.

G

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CABwQhLmeDOx9fAuVqPXkwroqVEvi0HZAMjcDOtBxj6b5n_7taQ%40mail.gmail.com.

Re: [tesseract-ocr] Using Tesseract on Fortran code from late 60's

Reply via email to