Re: [tesseract-ocr] Microscopy label, poor recognition

Merlijn B.W. Wajer Tue, 21 Dec 2021 02:53:44 -0800

Hi Martin,

Some of the advice below applies to Tesseract 5 only...

On 21/12/2021 09:38, 'Martin Weihrauch' via tesseract-ocr wrote:
>  
> 
> I have an image (label of a microscopy slide), which I thought would be 
> easy to OCR, because it is easily readable for humans. I am using the 
> latest Tesseract V5 as a command line under Windows However, with
> tesseract image.jpg image.txt --oem 1 --psm x 
> 
> with "--psm x" x being any number, which I tried, the results are poor (it 
> misses the bottom line with "LOT40446" and thinks "+" is a "4" after 
> binarization of the image I post here. Is there anything I can do to 
> improve the results? 
> 
> I tried:
> 
> - Binarizing the image
> 
> - Setting DPI to 300 dpi
> 
> With these latter, it produced: 
> 
> *| +125 PROCock tai*
> 
> * | 12/03/2021*
> 
> *| 36729/21 344*

This seems to work decent for reading the text you pasted above:

> $ tesseract --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg -
> | +125 PROCock tai
> 
> | 12/03/2021
> | 36729/21 3+4

But it still doesn't pick up the other text, which seems more like
segmentation problem. You can try to experiment with other psm values
(with --psm 11 it finds '40446').
You can try other thresholding_method's (0, 1, 2) as well:

> $ tesseract --psm 11 --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg 
> -
> ay els
> 
> 12/03/2021
> 
> 36729/21 3+4
> 
> LOT
> 
> 40446

If the segmentation isn't what you hoped for, you could also try
manually segmenting the image, or at least cropping it a bit more (to
make it more clear) before passing it to Tesseract.

For microfiche labels (not microscopy), I resorted to manual
segmentation (with prior knowledge of the material) and also had to
retrain Tesseract to deal with dot matrix fonts, but you don't seem to
need that. Probably with a bit more tweaking of either image cleanup or
segmentation you can get pretty decent results.

Regards,
Merlijn

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fe08ef6b-52db-edcb-b070-0153bfc34e29%40archive.org.

Re: [tesseract-ocr] Microscopy label, poor recognition

Reply via email to