Hi Zdenko, thanks for the suggestion. Although still not perfect, "--oem 0" did produce the best results yet and I have been able to correct the 15-or-so errors manually (this was one of five hundred images that needed digitising). Still puzzled as to why these errors are there but I guess it'll have to do.
thanks again Giorgos On Tuesday, July 13, 2021 at 10:51:23 AM UTC+3 zdenop wrote: > Use legacy engine for this type of input: > > tesseract digits.jpg - --oem 0 > Estimating resolution as 769 > 2.565 > 2.597 > 2.614 > 2.528 > 2.441 > 2.564 > 2.530 > '2.479 > 2.601 > 2.601 > 2.569 > 2.555 > > 2.437 > 2.531 > 2.592 > '2.385 > 2.618 > 2.738 > 2.766 > 2.473 > 2.624 > 2.611 > 2.749 > 2.730 > > Zdenko > > > ut 13. 7. 2021 o 9:38 Giorgos Papageorgiou <[email protected]> napĂsal(a): > >> I am having issues getting tesseract to recognise a column of numbers in >> what I naively assume should be a straightforward problem. Most of the >> issues come from a mis-recognition of the decimal point - it either skips >> it, or mistakes it for a number. I call tesseract 4.1.1 with the options >> " -c tessedit_char_whitelist=-.0123456789 --psm 4 -l eng --oem 2" and I am >> interested to get a column of numbers in tabular form. After pre-processing >> my image, I have something of the sort: >> [image: 20.jpg] >> which is then recognised as: >> >> 2.565 >> 2597 >> 2.614 >> 2528 >> 2.441 >> 2564 >> 2.530 >> 24479 >> 2.601 >> 2.601 >> 2.569 >> 24555 >> 2.437 >> 2.531 >> 2.592 >> 2.385 >> 2.618 >> 2.738 >> 2.766 >> 24473 >> 2.624 >> 2.611 >> 2.749 >> 2.730 >> >> I can't afford to skip decimal points and there is no fixed pattern where >> the decimal points are (so can't skip "." nor "-" from the list of allowed >> characters). Can someone advise whether this is a pre-processing or >> tesseract issue and how I could improve OCR here? >> >> Thanks >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/31697a83-777c-4eae-92b6-04ad75ba4ab1n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/31697a83-777c-4eae-92b6-04ad75ba4ab1n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eddc1eaf-fc05-4fa4-94f6-d5c016a579b0n%40googlegroups.com.

