Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-05 Thread Ben Bongalon
licly > so going forward I can help the next person. > > Keith > > > Original message ---- > From: Ben Bongalon > Date: 1/5/21 11:56 PM (GMT-05:00) > To: Keith M > Cc: tesseract-ocr@googlegroups.com > Subject: Re: [tesseract-ocr] advice for OCR'i

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-05 Thread kmongm
f I see more success with the training. I'll also make my models available publicly so going forward I can help the next person. Keith Original message From: Ben Bongalon Date: 1/5/21 11:56 PM (GMT-05:00) To: Keith M Cc: tesseract-ocr@googlegroups.com Subject: Re: [tesseract

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-05 Thread Ben Bongalon
The link you cited prescribes a method where you must provide an image file for each line of text in your groundtruth data. So if you print out pages of sample BASIC programs on your dot-matrix printer, you would then: 1. scan the pages, 2. crop each tex

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-05 Thread Keith M
Ben, Thanks for the interest and chiming in. Yes, I used tesseract 5.0, eng, BASIC command keywords in eng.user-words, white-listed only allowed characters, and loading/not loading user dictionary/freq. I haven't tried training yet. I could probably find and even generate, assuming new ink

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-05 Thread Ben Bongalon
Hi Keith, Interesting project. Having looked at the sample OCR results that Alex posted, I think the poor recognition from Tesseract is more likely due to the underlying language model used (I'm assuming you used 'eng'?). For example, the "test1" OCR results correctly transcribes the variables

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-04 Thread Keith M
Hello again Alex, Thanks for the conversation. I have someone who has offered to modify a similar, but slightly different, font for me. This would potentially allow some optimization on recognition. For instance, Abbyy FineReader accepts a font file, and providing a matching one, it's suppose

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-04 Thread Alex Santos
Hi Keith I read your reply with great interest because your case appears to be rather unique in that you are try to OCR lines and lines of dot matrix characters and it’s an interesting project to translate those old BASIC listings to a PDF or a txt file. So I followed your links and your adven

[tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

2020-12-13 Thread Keith M
Hi there, I've been circling a problem with OCR'ing 90-pages of 30 year old BASIC code. I've been working on optimizing my scanning settings, and pre-processing, stuck in photoshop for hours messing around. Long couple days with this stuff! I've been through tessdoc, through the FAQ, through w