Re: [tesseract-ocr] The pictures captured by the camera did not identify well after preprocessing

vis li Thu, 16 Sep 2021 23:27:57 -0700

Thanks  for your suggestions.
I will try some of these suggestions to improve my program and hardware.
The results after image processing look much clearer than I have provided.
This is clearly beneficial for the image to identify the correct results.
Because the picture quality I can provide is not very high, it does not 
require 100% accuracy; 
however hope the identification results are close to the achieved level and 
stable.
Thank you very much for it seems very late to back and to take some time to 
verify.


Liwei

在2021年9月16日星期四 UTC+8 下午10:57:55<zdenop> 写道：

> Few hints:
>
>    1. Use another format than jpg if you want OCR image
>    2. Try to take images with better resolution (e.g. so there is clear 
>    space between letters)
>    3. Use greyscale colors.
>    4. Use white (light) background
>    5. For nontextual (not real words e.g. code) information legacy engine 
>    works better (LSTM tends to "see words")
>    6. Try to pass tesseract homogeneous  block (lines, paragraphs)
>
> In my opinion, you need to expect that OCR results will not be 100% in 
> cases like this. Maybe training would help (for the legacy engine), but I 
> would focus first on about mentions hints.
>
> [image: camera_part1.png]
> > tesseract camera_part1.png - --psm 6 --oem 0
> ACBEDFHGIKJLNTHOP
> RQSUTVXWYaZbdcef
>
> [image: camera_part2a.png]
> > tesseract camera_part2a.png - --psm 8 --oem 0
> sonppppPPPFFFppp
>
> [image: camera_part2b.png]
> > tesseract camera_part2b.png - --psm 8 --oem 0
> "&*()+—,.:;<>=?/
>
>
> Zdenko
>
>
> št 16. 9. 2021 o 10:31 vis li <liwe...@gmail.com> napísal(a):
>
>> Thanks for your answer，
>> The text of the picture is a test case ，the reason why i use this test 
>> case is that the  actual text is produced by stm32  microcontroller .
>> it produce text like "E2PROM ADDR6".Text itself may be some abnormal text 
>> language ...
>> 'zth' is the library i  have trained with Microsoft Yahei Standard font . 
>> I have used eng library
>> which is Official word library file downloaded from the corresponding 
>> version of tesseract .
>> It was not as accurate as I trained myself 
>>
>>
>> 在2021年9月16日星期四 UTC+8 下午4:04:36<Lorenzo Blz> 写道：
>>
>>> Hi Vli,
>>> I think you should test this on something similar to your actual text, 
>>> not on the alphabet or random strings.  With real text you are not going to 
>>> see () or <> that may be mistaken for a O.
>>>
>>> The sequence of characters may influence the output, in other words try 
>>> it on real text. You can also blacklist the characters you do not need.
>>>
>>> To be honest, the result does not seem bad to me. Special characters are 
>>> the most difficult ones.
>>>
>>> Also this font is not easy to read, look at the M letter for example. If 
>>> you can, change the font or try to capture the image at higher resolution 
>>> before cleaning it.
>>>
>>> What language is zth? This looks like latin text, did you try eng?
>>>
>>>
>>> Lorenzo
>>>
>>> Il giorno gio 16 set 2021 alle ore 07:59 vis li <liwe...@gmail.com> ha 
>>> scritto:
>>>
>>>> Tesseract Version：4.1.1
>>>> Platform:Window10
>>>>
>>>> <https://user-images.githubusercontent.com/51877381/133545017-12e2b715-be45-4198-8035-9838c5375ea9.png>[image:
>>>>  
>>>> testa.png]
>>>>
>>>> <https://user-images.githubusercontent.com/51877381/133545026-66cdd822-6885-4561-aa8c-d13496573a62.png>[image:
>>>>  
>>>> testb.png]
>>>> Page.getText():
>>>>
>>>> ACBEDFHGIKJLNHOP
>>>> RQSUTV¥WYaZbdcef
>>>>
>>>> 1ppp000012121010
>>>> &*(O+-,.:; O=%/
>>>>
>>>> like this，the result has some faults.
>>>> I know that my image has some defects,but how can i improve this 
>>>> situation?
>>>> I have done the binarization of the picture,and try to improve dpi to 
>>>> 300
>>>> Because the pictures captured by the camera,I am worried if they can 
>>>> meet the standard for web pictures
>>>>
>>>> I have used LTSM mode ,and my Identified word library file is trained 
>>>> by LTSM and Microsoft Yahei Standard font
>>>>
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/96ce0479-bc22-477d-9d5b-a6408509121fn%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/96ce0479-bc22-477d-9d5b-a6408509121fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/765451c2-7440-4a5c-acf5-41ce4e42daa8n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/765451c2-7440-4a5c-acf5-41ce4e42daa8n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8d857e69-bc94-43bd-accf-f571007f4b47n%40googlegroups.com.

Re: [tesseract-ocr] The pictures captured by the camera did not identify well after preprocessing

Reply via email to