Tesseract is an OCR engine, so try to eliminate graphics elements by
yourself/send only text areas to OCR.

Zdenko


ut 20. 4. 2021 o 10:40 Soul Green <soulu...@gmail.com> napísal(a):

> Omg thanks.
> I hadn't thought about checking *that *documentation. I've been using
> tesseract.js with node so I completely forgot that it was based on
> something else. How amateur.
> I also didn't know that tesseract did its own processing as well.
> Thanks again I'll try everything there
> On Tuesday, 20 April 2021 at 5:14:56 pm UTC+10 zdenop wrote:
>
>> Hint: read documentation, stop guessing. You can start here
>> https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md
>>
>> Zdenko
>>
>>
>> ut 20. 4. 2021 o 9:11 Soul Green <soul...@gmail.com> napísal(a):
>>
>>> I am very new to coding so forgive me.
>>>
>>> I have been having an extremely low success rate with tesseract.
>>> Here are 3 examples both pre- and post- processing:
>>>
>>> [image: red1.jpg][image: croppedred1.jpg]            [image:
>>> yellow1.jpg][image: croppedyellow1.jpg]              [image: 
>>> blue1.jpg][image:
>>> croppedblue1.jpg]
>>> These were scanned as "a" ,"Ss30", and "moh" respectively.
>>> I consider the yellow one a success, as I can just regex the 30 out of
>>> the result, but I still don't understand how it could be so off for the
>>> rest.
>>>
>>> I've tried different traineddatas, even including one that I trained
>>> myself on over 200 data examples.
>>>
>>> I have three theories as to why I couldn't train it:
>>> 1. The different colours are processed differently, causing differently
>>> shaped characters. (Red looks bold and yellow looks thin)
>>> 2. The different sizes of the images causes the characters to be
>>> slightly differently shaped when cropped.
>>> 3. Tesseract assumes that the two lines of text are one, and reads them
>>> together.
>>>
>>> Could someone please give me a hint on what to try? I don't want to
>>> spend another day training it on just blue ones (for example) only to find
>>> that colour isn't the problem.
>>> Thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z2rpc1zXyNP86nkdVERcb%3D28u94NxU125n_t_QbXSRMQ%40mail.gmail.com.

Reply via email to