Alex, Thanks for replying, appreciate the time. Especially the command line with various options specified!
I've spent hours and hours googling both before posting here, and afterwards. There's SOME information out there, but no real smoking gun. Most of the ideas in the first 10 pages of google results have not panned out in terms of EFFECTIVE results. https://github.com/ameera3/OCR_Expiration_Date looks pretty interesting, but it felt overly complicated to me. other responses in-line On 1/1/2021 3:07 PM, Alex Santos wrote: To overcome [the fact that the dots when scanned in hi-res are individual] > you might need to preprocess the > scanned images with some image editing software to find a sweet spot. I would probably start by doing a high contrast medium resolution scan, then add some gaussian blue to effectively marry the dots into a continuous shape, rather than individual dots and then use some leveling tool to tighten the soft blur around the edges. Spent a few hours messing around with this. https://jeffreymorgan.io/articles/improve-dot-matrix-ocr-performance-tutorial/ I get idea, and if I read you right, you're saying basically the same thing. However, it really didn't pan out. Yes, the characters look more like traditional text, but there was no dramatic improvement in recognition. Part of the problem is that there are so many variables and it's hard to isolate minor improvements. I attached a zip file with two tests based on the sample image you provided. I didn't get a good chance to make all the comparisons but I created a PNG with some gaussian blur and then contracting the levels gave me what appear to be decent results. I also scaled the processed image to 200% and saved it as a TIF. Thanks for doing this. Here's my current state process that is yielding very good results: * Use Windows scanning software(linux works too, but more cumbersome) with Fujitsu IX500 scanner: Setting Black and White adjusted 75% dark, 1200 dpi. * Use pdftoppm with -gray option to spit out a *.pgm file at full resolution. * Use unpaper (https://github.com/unpaper/unpaper) with default options to pre-process the scanned image. This really helps! * Convert to *.png and resize 50%. Doing this because AWS Textract can't take such a large image: 8.5x11 at 1200 dpi is 10,200 x 13,200! * Use AWS's Textract(https://aws.amazon.com/textract/) to perform the OCR. I can't recommend this service enough. It's practically free. Super easy to use (10) lines of python to call from Linux. You get feedback per line/word/block/page with confidence values. Average confidence value is 98%+. I'm going to type a more comprehensive document but some basic results on LIMITED testing, comparison using Text Compare in Beyond Compare 4: Amazon's Textract: Only 1 wrong character in 1020. Two other smaller excusable defects (an extra : detected) and a one-letter mistake. This simply works out of the box with zero configuration. Tesseract: With whitelisting only characters, and added BASIC keywords to eng.user-words. Definitely can't get under about 12 lines worth of mistakes. Approximately 80% accuracy with this one test document. I feel like there's room for optimization, but I'm not sure I'm going to chase it. Alex test1: 25 different lines (not great) Alex test2: 18 different lines (A little worse than my best tesseract run with any configuration) Abbyy FineReader 15: Pretty horrible results Abbyy Cloud OCR: Better than the application, but can't easily evaluate results. ReadIris 17: Pretty horrible results Without sounding too much like an Amazon commercial (no relation beyond happy customer here), Amazon Textract has a feature called A2I which routes low confidence value recognition lines through machine learning, and then implements Human Review using Amazon Mechanical Turk. I'm not using A2I, but I **am** going to manually route my results through MTurk. It's a couple extra manual steps, and I have to pay for this human review (maybe $50 by the time I'm done), but I think it's neat, and I like learning about new technology. Hope the group finds this info useful. Thanks, Keith On Friday, January 1, 2021 at 11:32:40 PM UTC-5 Keith M wrote: > Ger, > > Thanks for taking the time to reply. > > On 1/1/2021 4:00 PM, Ger Hobbelt wrote: > Another technique specifically for dot-matrix might be to blend multiple > copies of the scan at small offsets. The idea here is that back in the old > days of dot matrix, a few DTP applications had printing modes which would > print dot patterns several times on the same line, but ever so slightly > offset from one another to 'fill the character up'. The poor man's way to > print BOLD characters that way was to print the same line multiple times at > slight offsets. > > > The printer's manual actually details so much of this internal working. > Besides schematics and BOM lists, descriptions of theory of operation, etc > I had forgotten the level of detail we used to get when we bought a > multi-hundred dollar product. > > Hence to simulate this sort of 'gap closing', one could scan at higher > resolution, then offset the image multiple times in various directions by > "half a printer dot" (or less) and blend the copies using a blending mode > like Photoshop Darken. > > I **believe** that morphological dilation is similar to what you're > talking about here. > > "Dilation [...] adds a layer of pixels to both the inner and outer > boundaries of regions." > > from > > > https://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures/ImageProcessing-html/topic4.htm > > > I tried a few different techniques similar to what you've mentioned. While > conceptually it should help, practically speaking I saw only minimal > improvement. > > While it's still a work in progress, I'm describing my current best > efforts/results in the other reply here. > > Thanks, > Keith > > > On Friday, January 1, 2021 at 10:03:37 PM UTC-5 shree wrote: > >> Please see old thread at >> https://groups.google.com/g/tesseract-ocr/c/ApM_TqwV7aE/m/z5jZV0I0AgAJ >> for link to a completed project for dot matrix >> >> On Monday, December 14, 2020 at 12:11:00 PM UTC+5:30 Keith M wrote: >> >>> Hi there, >>> >>> I've been circling a problem with OCR'ing 90-pages of 30 year old BASIC >>> code. I've been working on optimizing my scanning settings, and >>> pre-processing, stuck in photoshop for hours messing around. Long couple >>> days with this stuff! >>> >>> I've been through tessdoc, through the FAQ, through wikipedia reading >>> about morphological operators. Through PPAs for 5.0.0-alpha-833-ga06c. >>> >>> I'm getting OK results so far, but need to process more images, my >>> workflow is tedious. >>> >>> Sample image here >>> https://www.techtravels.org/wp-content/uploads/2020/12/FNBBS-02_crop.png >>> >>> 150dpi image extracted via pdftoppm -png from a 1200dpi scan. While it's >>> not super clear to me why, higher res scans are resulting in WORSE OCR's. >>> >>> *TLDR; What should be the ideal configuration of tesseract for my >>> application? Disable the dictionary? Can I add BASIC commands and keywords >>> to eng.user-words? From the manual "CONFIG FILES AND AUGMENTING WITH USER >>> DATA" section ??* >>> >>> I could use some help, thanks! >>> >>> Keith >>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2f945abb-0eab-4504-877f-8dc7c61d5a0an%40googlegroups.com.