Re: [tesseract-ocr] Using Tesseract on Fortran code from late 60's

jesterjunk Tue, 11 Feb 2025 23:58:07 -0800

Howdy Mixotricha,


I just happened upon your post and thought that I would share this 
playlist, as it is a deep dive into a lot of the complexities of OCR.

Preprocessing is a major thing for getting optimal OCR results, that is why 
I put the video title in Bold for it below.

OCR in Python 
https://www.youtube.com/playlist?list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez 
<https://www.youtube.com/playlist?list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez>

*⠀⠀01⠀⠀12:08⠀⠀*Introduction to OCR 
<https://www.youtube.com/watch?v=tQGgGY8mTP0&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=1>
 
(OCR in Python Tutorials 01.01)
*⠀⠀02⠀⠀11:14⠀⠀*How to Install the Libraries 
<https://www.youtube.com/watch?v=89m89vVh4wg&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=2>
 
(OCR in Python Tutorials 01.02)
*⠀⠀03⠀⠀ 7:46⠀⠀*How to Open an Image in Python with PIL (Pillow) 
<https://www.youtube.com/watch?v=UxYJxcdLrs0&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=3>
 
(OCR in Python 02.01)
*⠀⠀04⠀⠀53:24⠀⠀**How to Preprocess Images for Text OCR in Python 
<https://www.youtube.com/watch?v=ADV-AjAXHdc&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=4>*
 
(OCR in Python Tutorials 02.02)
*⠀⠀05⠀⠀ 6:18⠀⠀*Introduction to PyTesseract 
<https://www.youtube.com/watch?v=4uWp6dS6_G4&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=5>
 
(OCR in Python Tutorials 02.03)
*⠀⠀06⠀⠀ 5:37⠀⠀*How to OCR an Index in Python with PyTesseract 
<https://www.youtube.com/watch?v=DXYPXZH2eGE&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=6>
 
(OCR in Python Tutorials 03.01)
*⠀⠀07⠀⠀18:27⠀⠀*How to use Bounding Boxes with OpenCV 
<https://www.youtube.com/watch?v=9FCw1xo_s0I&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=7>
 
(OCR in Python Tutorials 03.02)
*⠀⠀08⠀⠀12:58⠀⠀*How to Create a List of Named Entities from an Index with 
OpenCV 
<https://www.youtube.com/watch?v=y1iw8c2CEgw&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=8>
 
(OCR in Python Tutorials 03.03)
*⠀⠀09⠀⠀15:48⠀⠀*How to OCR a Text with Marginalia by Extracting the Body 
<https://www.youtube.com/watch?v=DV5c9qHv0NQ&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=9>
 
(OCR in Python Tutorials 04.01)
*⠀⠀10⠀⠀ 7:14⠀⠀*How to Separate a Footnote from Body Text in Python with 
OpenCV 
<https://www.youtube.com/watch?v=ZeCRe9sNFwk&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=10>


Remember to Breathe,
jesterjunk

On Tuesday, February 11, 2025 at 7:09:40 PM UTC-6 Graham Toal wrote:

> On Tue, Feb 11, 2025 at 5:52 PM Mixotricha <connoll...@gmail.com> wrote:
>
>> Thanks that is a really helpful link. Unfortunately I do not have much 
>> chance of getting better documents. The second scan came from a helpful 
>> archivist at an installation that requires a classification to enter. 
>> Otherwise I would literally get on a plane and go and look myself. I was 
>> gratified that they were as helpful as they were. Really the halting point 
>> in this translation is not the human words. It is the jump vectors ( the 
>> goto statements ) and so now I am back to seeing if I can figure out some 
>> sort of relationship in the jump vectors in the left hand column. 
>> Unfortunately they do not match the line numbers on the right hand side. 
>> But maybe I have just not figured out what that relationship might be. 
>> Basically back to searching for context. Some other things in my favour are 
>> that the thesis itself is an excellent piece of work really well explained 
>> and has what are basically unit tests included that are themselves quite 
>> legible. I feel getting this code back is right on the edge of possibility 
>> if I just think about it a bit more. 
>>
>
> I sympathise on the access problem - we submitted a bunch of listings and 
> docs to our local museum for safe keeping and haven't seen it since.  I 
> guess they're being kept very safe :-/
>
> But don't give up hope on getting better access.  I was quite impressed 
> that the folks working on restoring the Bloodhound at bmpg.org.uk were 
> able to get access to the original Coral66 source code.  I myself managed 
> to get the MOD's Defence Procurement Agency to give me permission to post 
> the Coral 66 manual, just by asking via the contact page at the HMSO.  So 
> you never know... sometimes these people can be surprisingly reasonable.
>
> So my fixed-pitch stuff isn't going to help you.  I have two other 
> suggestions: 1) classic re-keying by 2 or 3 independent people.  (if 2, 
> then someone has to go over the differences and explicitly make a 
> selection; if 3, use a 2 out of 3 consensus to pick the preferred version. 
> Neither is foolproof but does considerably lower the rate of errors.);  and 
> 2) there's some experimental dewarping software worth trying such as 
> https://mzucker.github.io/2016/08/15/page-dewarping.html which might be 
> better than the sort of sortware used in things like CZUR scanners that 
> have a very specific model of a V shaped spine between pages of a book.
>
> Looking at your hand-tidied source I would expect that a custom fortran 
> parser could find a lot of corrections, simply by keeping a name and 
> frequency table of variables - to catch things like CCMREG vs COMREG for 
> example and automatically suggesting the preferred version.  I found that a 
> hacked-up parser for Algol 60 was extremely helpful at that sort of 
> correction, leaving only a few minor errors to catch using a real compiler 
> once the sources were cleaned up enough to be compilable.
>
> Good luck with your project.
>
> G
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/39cde131-2e62-4743-bbe5-7825d53b298bn%40googlegroups.com.

Re: [tesseract-ocr] Using Tesseract on Fortran code from late 60's

Reply via email to