Thank you so much for your efforts! Merlijn Wajer schrieb am Dienstag, 21. Dezember 2021 um 11:53:44 UTC+1:
> Hi Martin, > > Some of the advice below applies to Tesseract 5 only... > > On 21/12/2021 09:38, 'Martin Weihrauch' via tesseract-ocr wrote: > > > > > > I have an image (label of a microscopy slide), which I thought would be > > easy to OCR, because it is easily readable for humans. I am using the > > latest Tesseract V5 as a command line under Windows However, with > > tesseract image.jpg image.txt --oem 1 --psm x > > > > with "--psm x" x being any number, which I tried, the results are poor > (it > > misses the bottom line with "LOT40446" and thinks "+" is a "4" after > > binarization of the image I post here. Is there anything I can do to > > improve the results? > > > > I tried: > > > > - Binarizing the image > > > > - Setting DPI to 300 dpi > > > > With these latter, it produced: > > > > *| +125 PROCock tai* > > > > * | 12/03/2021* > > > > *| 36729/21 344* > > This seems to work decent for reading the text you pasted above: > > > $ tesseract --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg - > > | +125 PROCock tai > > > > | 12/03/2021 > > | 36729/21 3+4 > > But it still doesn't pick up the other text, which seems more like > segmentation problem. You can try to experiment with other psm values > (with --psm 11 it finds '40446'). > You can try other thresholding_method's (0, 1, 2) as well: > > > $ tesseract --psm 11 --dpi 600 -c thresholding_method=2 -l eng > /tmp/JBOBF.jpg - > > ay els > > > > 12/03/2021 > > > > 36729/21 3+4 > > > > LOT > > > > 40446 > > If the segmentation isn't what you hoped for, you could also try > manually segmenting the image, or at least cropping it a bit more (to > make it more clear) before passing it to Tesseract. > > For microfiche labels (not microscopy), I resorted to manual > segmentation (with prior knowledge of the material) and also had to > retrain Tesseract to deal with dot matrix fonts, but you don't seem to > need that. Probably with a bit more tweaking of either image cleanup or > segmentation you can get pretty decent results. > > Regards, > Merlijn > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com.