Re: [tesseract-ocr] Microscopy label, poor recognition

'Martin Weihrauch' via tesseract-ocr Tue, 21 Dec 2021 02:57:48 -0800

Thank you so much for your efforts!

Merlijn Wajer schrieb am Dienstag, 21. Dezember 2021 um 11:53:44 UTC+1:


> Hi Martin,
>
> Some of the advice below applies to Tesseract 5 only...
>
> On 21/12/2021 09:38, 'Martin Weihrauch' via tesseract-ocr wrote:
> > 
> > 
> > I have an image (label of a microscopy slide), which I thought would be 
> > easy to OCR, because it is easily readable for humans. I am using the 
> > latest Tesseract V5 as a command line under Windows However, with
> > tesseract image.jpg image.txt --oem 1 --psm x 
> > 
> > with "--psm x" x being any number, which I tried, the results are poor 
> (it 
> > misses the bottom line with "LOT40446" and thinks "+" is a "4" after 
> > binarization of the image I post here. Is there anything I can do to 
> > improve the results? 
> > 
> > I tried:
> > 
> > - Binarizing the image
> > 
> > - Setting DPI to 300 dpi
> > 
> > With these latter, it produced: 
> > 
> > *| +125 PROCock tai*
> > 
> > * | 12/03/2021*
> > 
> > *| 36729/21 344*
>
> This seems to work decent for reading the text you pasted above:
>
> > $ tesseract --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg -
> > | +125 PROCock tai
> > 
> > | 12/03/2021
> > | 36729/21 3+4
>
> But it still doesn't pick up the other text, which seems more like
> segmentation problem. You can try to experiment with other psm values
> (with --psm 11 it finds '40446').
> You can try other thresholding_method's (0, 1, 2) as well:
>
> > $ tesseract --psm 11 --dpi 600 -c thresholding_method=2 -l eng 
> /tmp/JBOBF.jpg -
> > ay els
> > 
> > 12/03/2021
> > 
> > 36729/21 3+4
> > 
> > LOT
> > 
> > 40446
>
> If the segmentation isn't what you hoped for, you could also try
> manually segmenting the image, or at least cropping it a bit more (to
> make it more clear) before passing it to Tesseract.
>
> For microfiche labels (not microscopy), I resorted to manual
> segmentation (with prior knowledge of the material) and also had to
> retrain Tesseract to deal with dot matrix fonts, but you don't seem to
> need that. Probably with a bit more tweaking of either image cleanup or
> segmentation you can get pretty decent results.
>
> Regards,
> Merlijn
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com.

Re: [tesseract-ocr] Microscopy label, poor recognition

Reply via email to