Hi Zdenko,

thanks for the suggestion. Although still not perfect, "--oem 0" did 
produce the best results yet and I have been able to correct the 15-or-so 
errors manually (this was one of five hundred images that needed 
digitising). Still puzzled as to why these errors are there but I guess 
it'll have to do.

thanks again

Giorgos

On Tuesday, July 13, 2021 at 10:51:23 AM UTC+3 zdenop wrote:

> Use legacy engine for this type of input:
>
> tesseract digits.jpg - --oem 0
> Estimating resolution as 769
> 2.565
> 2.597
> 2.614
> 2.528
> 2.441
> 2.564
> 2.530
> '2.479
> 2.601
> 2.601
> 2.569
> 2.555
>
> 2.437
> 2.531
> 2.592
> '2.385
> 2.618
> 2.738
> 2.766
> 2.473
> 2.624
> 2.611
> 2.749
> 2.730
>
> Zdenko
>
>
> ut 13. 7. 2021 o 9:38 Giorgos Papageorgiou <[email protected]> napĂ­sal(a):
>
>> I am having issues getting tesseract to recognise a column of numbers in 
>> what I naively assume should be a straightforward problem. Most of the 
>> issues come from a mis-recognition of the decimal point - it either skips 
>> it, or mistakes it for a number. I call tesseract 4.1.1 with the options 
>> " -c tessedit_char_whitelist=-.0123456789 --psm 4 -l eng --oem 2" and I am 
>> interested to get a column of numbers in tabular form. After pre-processing 
>> my image, I have something of the sort:
>>  [image: 20.jpg]
>> which is then recognised as:
>>
>> 2.565
>> 2597
>> 2.614
>> 2528
>> 2.441
>> 2564
>> 2.530
>> 24479
>> 2.601
>> 2.601
>> 2.569
>> 24555
>> 2.437
>> 2.531
>> 2.592
>> 2.385
>> 2.618
>> 2.738
>> 2.766
>> 24473
>> 2.624
>> 2.611
>> 2.749
>> 2.730
>>
>> I can't afford to skip decimal points and there is no fixed pattern where 
>> the decimal points are (so can't skip "." nor "-" from the list of allowed 
>> characters). Can someone advise whether this is a pre-processing or 
>> tesseract issue and how I could improve OCR here?
>>
>> Thanks
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/31697a83-777c-4eae-92b6-04ad75ba4ab1n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/31697a83-777c-4eae-92b6-04ad75ba4ab1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eddc1eaf-fc05-4fa4-94f6-d5c016a579b0n%40googlegroups.com.

Reply via email to