Howdy Mixotricha,
I just happened upon your post and thought that I would share this playlist, as it is a deep dive into a lot of the complexities of OCR. Preprocessing is a major thing for getting optimal OCR results, that is why I put the video title in Bold for it below. OCR in Python https://www.youtube.com/playlist?list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez <https://www.youtube.com/playlist?list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez> *⠀⠀01⠀⠀12:08⠀⠀*Introduction to OCR <https://www.youtube.com/watch?v=tQGgGY8mTP0&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=1> (OCR in Python Tutorials 01.01) *⠀⠀02⠀⠀11:14⠀⠀*How to Install the Libraries <https://www.youtube.com/watch?v=89m89vVh4wg&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=2> (OCR in Python Tutorials 01.02) *⠀⠀03⠀⠀ 7:46⠀⠀*How to Open an Image in Python with PIL (Pillow) <https://www.youtube.com/watch?v=UxYJxcdLrs0&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=3> (OCR in Python 02.01) *⠀⠀04⠀⠀53:24⠀⠀**How to Preprocess Images for Text OCR in Python <https://www.youtube.com/watch?v=ADV-AjAXHdc&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=4>* (OCR in Python Tutorials 02.02) *⠀⠀05⠀⠀ 6:18⠀⠀*Introduction to PyTesseract <https://www.youtube.com/watch?v=4uWp6dS6_G4&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=5> (OCR in Python Tutorials 02.03) *⠀⠀06⠀⠀ 5:37⠀⠀*How to OCR an Index in Python with PyTesseract <https://www.youtube.com/watch?v=DXYPXZH2eGE&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=6> (OCR in Python Tutorials 03.01) *⠀⠀07⠀⠀18:27⠀⠀*How to use Bounding Boxes with OpenCV <https://www.youtube.com/watch?v=9FCw1xo_s0I&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=7> (OCR in Python Tutorials 03.02) *⠀⠀08⠀⠀12:58⠀⠀*How to Create a List of Named Entities from an Index with OpenCV <https://www.youtube.com/watch?v=y1iw8c2CEgw&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=8> (OCR in Python Tutorials 03.03) *⠀⠀09⠀⠀15:48⠀⠀*How to OCR a Text with Marginalia by Extracting the Body <https://www.youtube.com/watch?v=DV5c9qHv0NQ&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=9> (OCR in Python Tutorials 04.01) *⠀⠀10⠀⠀ 7:14⠀⠀*How to Separate a Footnote from Body Text in Python with OpenCV <https://www.youtube.com/watch?v=ZeCRe9sNFwk&list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez&index=10> Remember to Breathe, jesterjunk On Tuesday, February 11, 2025 at 7:09:40 PM UTC-6 Graham Toal wrote: > On Tue, Feb 11, 2025 at 5:52 PM Mixotricha <connoll...@gmail.com> wrote: > >> Thanks that is a really helpful link. Unfortunately I do not have much >> chance of getting better documents. The second scan came from a helpful >> archivist at an installation that requires a classification to enter. >> Otherwise I would literally get on a plane and go and look myself. I was >> gratified that they were as helpful as they were. Really the halting point >> in this translation is not the human words. It is the jump vectors ( the >> goto statements ) and so now I am back to seeing if I can figure out some >> sort of relationship in the jump vectors in the left hand column. >> Unfortunately they do not match the line numbers on the right hand side. >> But maybe I have just not figured out what that relationship might be. >> Basically back to searching for context. Some other things in my favour are >> that the thesis itself is an excellent piece of work really well explained >> and has what are basically unit tests included that are themselves quite >> legible. I feel getting this code back is right on the edge of possibility >> if I just think about it a bit more. >> > > I sympathise on the access problem - we submitted a bunch of listings and > docs to our local museum for safe keeping and haven't seen it since. I > guess they're being kept very safe :-/ > > But don't give up hope on getting better access. I was quite impressed > that the folks working on restoring the Bloodhound at bmpg.org.uk were > able to get access to the original Coral66 source code. I myself managed > to get the MOD's Defence Procurement Agency to give me permission to post > the Coral 66 manual, just by asking via the contact page at the HMSO. So > you never know... sometimes these people can be surprisingly reasonable. > > So my fixed-pitch stuff isn't going to help you. I have two other > suggestions: 1) classic re-keying by 2 or 3 independent people. (if 2, > then someone has to go over the differences and explicitly make a > selection; if 3, use a 2 out of 3 consensus to pick the preferred version. > Neither is foolproof but does considerably lower the rate of errors.); and > 2) there's some experimental dewarping software worth trying such as > https://mzucker.github.io/2016/08/15/page-dewarping.html which might be > better than the sort of sortware used in things like CZUR scanners that > have a very specific model of a V shaped spine between pages of a book. > > Looking at your hand-tidied source I would expect that a custom fortran > parser could find a lot of corrections, simply by keeping a name and > frequency table of variables - to catch things like CCMREG vs COMREG for > example and automatically suggesting the preferred version. I found that a > hacked-up parser for Algol 60 was extremely helpful at that sort of > correction, leaving only a few minor errors to catch using a real compiler > once the sources were cleaned up enough to be compilable. > > Good luck with your project. > > G > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/39cde131-2e62-4743-bbe5-7825d53b298bn%40googlegroups.com.