Hi Zdenko,

Thanks. Your insights have been instrumental in helping me grasp the
concepts behind Tesseract.

I've been experimenting with various thresholding methods, such as Otsu
(0), LeptonicaOtsu (1), and Sauvola (2), and I've noticed that they yield
distinct outcomes when applied to my images. It seems that I might need to
develop custom preprocessing procedures tailored to the images (webpage
screenshots) before passing them to Tesseract.

Your guidance and suggestions are highly appreciated.


Best,

Haitao


On Mon, Jan 22, 2024 at 10:02 PM Zdenko Podobny <zde...@gmail.com> wrote:

> Hi,
>
> The most critical part is this:
> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html, but I need
> to stress: tesseract is OCR *engine *not OCR *suite*.
> Unless your input page is not a book page scan without a
> difficult structure, you need to do your part like image processing and
> document segmentation (detection of text block).
>
> This is the reason why you get "unsatisfactory" results if you send
> complicated images with non uniform texts, with graphics etc.
> However if you will use only text part of the image for recognition you
> can get very good results.
>
> Best regards,
>
> Zdenko
>
>
> po 22. 1. 2024 o 19:42 L ht <lhtao0...@gmail.com> napísal(a):
>
>> Hi Zdenko,
>>
>> Thanks for your response.
>> I read the Tesseract User Manual (
>> https://tesseract-ocr.github.io/tessdoc/), but not read the code
>>
>> I tried both tessdata_best and tessdata, tried different parameters of
>> --psm, still can not get more detections.
>>
>> To provide some context, when I applied Tesseract to the entire image, it
>> managed to identify only a few words, such as "Log in," "Username,"
>> "Password," and "Cancel," primarily within the central, well-lit portion.
>> However, when I cropped the image to retain either the upper or left
>> portions, Tesseract exhibited improved performance, successfully detecting
>> numerous words in those respective areas.
>>
>> Best,
>> Haitao
>>
>> On Sun, Jan 21, 2024 at 3:02 AM Zdenko Podobny <zde...@gmail.com> wrote:
>>
>>> Did you read the documentation or did you just set your expectations?
>>>
>>>
>>> Zdenko
>>>
>>>
>>> ne 21. 1. 2024 o 12:00 L ht <lhtao0...@gmail.com> napísal(a):
>>>
>>>> I am new to use tesseract. I found tesseract does not work as expected.
>>>> I attach one example.
>>>>
>>>> tesseract 5.3.2
>>>> tesseract 272525030292764523137280353496213864766.png - -l eng --psm 3
>>>> quiet
>>>> can only detect those words
>>>> "Log in
>>>> Username
>>>> Password
>>>> Cancel"
>>>>
>>>> I submit this picture to several online pic->txt converters. they work
>>>> well, detecting most of the text in the pic.
>>>> For example, https://www.imagetotext.info/ it claims that it use
>>>> tesseract
>>>>
>>>> I am not sure if I use tesseract correctly.
>>>> Does another can help test what's your detection result of this
>>>> picture?
>>>> Thanks
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zc4pyY%2BGJfVGrJ-yDMTo1tLn9DA502FJeB_V%3DLKi5p%2BQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zc4pyY%2BGJfVGrJ-yDMTo1tLn9DA502FJeB_V%3DLKi5p%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o8bO%3DQGktnpYHPcqiJ1g27mN3QWY%2BiP%2BW8VuJ8_h0fwLw%40mail.gmail.com.

Reply via email to