Re: [tesseract-ocr] The pictures captured by the camera did not identify well after preprocessing

Zdenko Podobny Thu, 16 Sep 2021 07:57:53 -0700

Few hints:

   1. Use another format than jpg if you want OCR image
   2. Try to take images with better resolution (e.g. so there is clear
   space between letters)
   3. Use greyscale colors.
   4. Use white (light) background
   5. For nontextual (not real words e.g. code) information legacy engine
   works better (LSTM tends to "see words")
   6. Try to pass tesseract homogeneous  block (lines, paragraphs)


In my opinion, you need to expect that OCR results will not be 100% in
cases like this. Maybe training would help (for the legacy engine), but I
would focus first on about mentions hints.

[image: camera_part1.png]
> tesseract camera_part1.png - --psm 6 --oem 0
ACBEDFHGIKJLNTHOP
RQSUTVXWYaZbdcef

[image: camera_part2a.png]
> tesseract camera_part2a.png - --psm 8 --oem 0
sonppppPPPFFFppp

[image: camera_part2b.png]
> tesseract camera_part2b.png - --psm 8 --oem 0
"&*()+—,.:;<>=?/


Zdenko


št 16. 9. 2021 o 10:31 vis li <liwei9...@gmail.com> napísal(a):

> Thanks for your answer，
> The text of the picture is a test case ，the reason why i use this test
> case is that the  actual text is produced by stm32  microcontroller .
> it produce text like "E2PROM ADDR6".Text itself may be some abnormal text
> language ...
> 'zth' is the library i  have trained with Microsoft Yahei Standard font .
> I have used eng library
> which is Official word library file downloaded from the corresponding
> version of tesseract .
> It was not as accurate as I trained myself
>
>
> 在2021年9月16日星期四 UTC+8 下午4:04:36<Lorenzo Blz> 写道：
>
>> Hi Vli,
>> I think you should test this on something similar to your actual text,
>> not on the alphabet or random strings.  With real text you are not going to
>> see () or <> that may be mistaken for a O.
>>
>> The sequence of characters may influence the output, in other words try
>> it on real text. You can also blacklist the characters you do not need.
>>
>> To be honest, the result does not seem bad to me. Special characters are
>> the most difficult ones.
>>
>> Also this font is not easy to read, look at the M letter for example. If
>> you can, change the font or try to capture the image at higher resolution
>> before cleaning it.
>>
>> What language is zth? This looks like latin text, did you try eng?
>>
>>
>> Lorenzo
>>
>> Il giorno gio 16 set 2021 alle ore 07:59 vis li <liwe...@gmail.com> ha
>> scritto:
>>
>>> Tesseract Version：4.1.1
>>> Platform:Window10
>>>
>>> <https://user-images.githubusercontent.com/51877381/133545017-12e2b715-be45-4198-8035-9838c5375ea9.png>[image:
>>> testa.png]
>>>
>>> <https://user-images.githubusercontent.com/51877381/133545026-66cdd822-6885-4561-aa8c-d13496573a62.png>[image:
>>> testb.png]
>>> Page.getText():
>>>
>>> ACBEDFHGIKJLNHOP
>>> RQSUTV¥WYaZbdcef
>>>
>>> 1ppp000012121010
>>> &*(O+-,.:; O=%/
>>>
>>> like this，the result has some faults.
>>> I know that my image has some defects,but how can i improve this
>>> situation?
>>> I have done the binarization of the picture,and try to improve dpi to 300
>>> Because the pictures captured by the camera,I am worried if they can
>>> meet the standard for web pictures
>>>
>>> I have used LTSM mode ,and my Identified word library file is trained by
>>> LTSM and Microsoft Yahei Standard font
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/96ce0479-bc22-477d-9d5b-a6408509121fn%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/96ce0479-bc22-477d-9d5b-a6408509121fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/765451c2-7440-4a5c-acf5-41ce4e42daa8n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/765451c2-7440-4a5c-acf5-41ce4e42daa8n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z9WrEpgRDHd44vSY9mt-AMPzZ45k1K7GF3mDD08Fz6aQ%40mail.gmail.com.

Re: [tesseract-ocr] The pictures captured by the camera did not identify well after preprocessing

Reply via email to