Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

'Yash Mistry' via tesseract-ocr Fri, 24 Jun 2022 00:22:56 -0700

Hi Lorenzo,

Thank you for the suggestions.


The first approach you suggest is not feasible for me because there is no 
certainty that at particular position specific type of data will present.

I am interested in second approach, I am trying to find any functionality 
of tesseract which give me all possible prediction for the specific letter 
bur I haven't found any solution yet.

Can you please help me from where did you find this kind of functionality 
in tesseract and of which version of tesseract?

Thank you

On Tuesday, June 7, 2022 at 1:45:48 PM UTC+5:30 Lorenzo Blz wrote:

> Hi Yash,
> in my experience you are going top see a lot of these errors on similar 
> characters.
>
>
> Given the pre processed text only I might do the same mistake myself.
>
>
> What I do is to fix these letters according to a pattern, in this case 
> WDDDDDDD
>
> and I replace:
>
> S <-> 8
> O <-> 0
> I  <->  1
> i  <->  1
> l  <->  1
> z  <->  2
> Z  <->  2
> etc.
>
> Another options, but I'm not 100% sure if it is possible with the latest 
> version, is to ask tesseract for the whole list of predictions for each 
> token with confidence. For the first token you'd get something like:
>
> S: 0.6839
> 8: 0.2123
> B: 0.1445
> ...
>
> and, again according to a pattern, you select the best matching one (you 
> need to use the choiceIterator on the result object iterating at level 
> SYMBOL). This second approach is more elegant but I do not think is giving 
> you much more over the simpler approach.
>
> Of course a little bit of model fine tuning helps but will not fix these 
> problems 100% and it takes a lot of time to do it properly.
>
>
> I recommend using tessocr that is a real API/library wrapper (not a 
> command line wrapper...), it gives you access to the whole API and, if used 
> properly, it is a lot faster.
>
>
>
> Bye
>
> Lorenzo
>
> Il giorno mar 7 giu 2022 alle ore 09:50 'Yash Mistry' via tesseract-ocr <
> tesser...@googlegroups.com> ha scritto:
>
>> I am facing challenge to extract correct a letter from a word which are 
>> look-alike, i.e 5 & S, I & 1, 8 & S.
>>
>> I applied image pre-processing techniques like Blurring, erode, dilate, 
>> normalised the noise, remove unnecessary component and text detection from 
>> the input image but after these much of pre-processing tesseract OCR isn't 
>> giving correct result.
>>
>> Please check attached images,
>>
>> *Original Image*
>>
>>
>> *[image: image.png]*
>>
>> *Pre-processed Image*
>>
>> [image: image (1).png]
>>
>> *Detected Text*
>>
>>
>> *[image: image (2).png]*
>>
>>
>> *[image: image (3).png]*
>>
>> *Tesseract Configuration*
>>
>> -l eng --oem 1 --psm 7 -c 
>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" 
>> load_system_dawg=false load_freq_dawg=false
>>
>> *Result of OCR*: TITLENUMBER 81003716
>>
>> As we can see OCR extract S as 8 even after pre-processing and text 
>> detection.
>>
>> Is there anyway we can overcome this problem?
>>
>> *Tesseract Version*: tesseract 5.1.0-32-gf36c0
>>
>> Note: Asked same question in pytesseract github repo and got suggestion 
>> to drop this question here.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c46185ed-b502-4320-bf98-966a6b2e90een%40googlegroups.com.

Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

Reply via email to