[tesseract-ocr] Re: Regarding for OCR guidance ...

Santhiya C Thu, 25 Jan 2024 04:08:15 -0800

Hi Guys , i will start development *OCR using image and Pdf to text 
extraction *what are the steps i need to follow , can you pleasse refer me 
the best model , already i had used the pytesseract engine but i did not 
get proper extraction ...


Best Regards,

Sandhiya
On Tuesday 23 January 2024 at 23:14:40 UTC+5:30 lhta...@gmail.com wrote:

> Hi Zdenko,
>
> Thanks. Your insights have been instrumental in helping me grasp the 
> concepts behind Tesseract.
>
> I've been experimenting with various thresholding methods, such as Otsu 
> (0), LeptonicaOtsu (1), and Sauvola (2), and I've noticed that they yield 
> distinct outcomes when applied to my images. It seems that I might need to 
> develop custom preprocessing procedures tailored to the images (webpage 
> screenshots) before passing them to Tesseract.
>
> Your guidance and suggestions are highly appreciated.
>
>
> Best,
>
> Haitao
>
>
> On Mon, Jan 22, 2024 at 10:02 PM Zdenko Podobny <zde...@gmail.com> wrote:
>
>> Hi,
>>
>> The most critical part is this: 
>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html, but I need 
>> to stress: tesseract is OCR *engine *not OCR *suite*.
>> Unless your input page is not a book page scan without a 
>> difficult structure, you need to do your part like image processing and 
>> document segmentation (detection of text block).
>>
>> This is the reason why you get "unsatisfactory" results if you send 
>> complicated images with non uniform texts, with graphics etc.
>> However if you will use only text part of the image for recognition you 
>> can get very good results.
>>
>> Best regards,
>>
>> Zdenko
>>
>>
>> po 22. 1. 2024 o 19:42 L ht <lhta...@gmail.com> napísal(a):
>>
>>> Hi Zdenko,
>>>
>>> Thanks for your response.
>>> I read the Tesseract User Manual (
>>> https://tesseract-ocr.github.io/tessdoc/), but not read the code
>>>
>>> I tried both tessdata_best and tessdata, tried different parameters of 
>>> --psm, still can not get more detections. 
>>>
>>> To provide some context, when I applied Tesseract to the entire image, 
>>> it managed to identify only a few words, such as "Log in," "Username," 
>>> "Password," and "Cancel," primarily within the central, well-lit portion. 
>>> However, when I cropped the image to retain either the upper or left 
>>> portions, Tesseract exhibited improved performance, successfully detecting 
>>> numerous words in those respective areas.
>>>
>>> Best,
>>> Haitao
>>>
>>> On Sun, Jan 21, 2024 at 3:02 AM Zdenko Podobny <zde...@gmail.com> wrote:
>>>
>>>> Did you read the documentation or did you just set your expectations?
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ne 21. 1. 2024 o 12:00 L ht <lhta...@gmail.com> napísal(a):
>>>>
>>>>> I am new to use tesseract. I found tesseract does not work as 
>>>>> expected. I attach one example. 
>>>>>
>>>>> tesseract 5.3.2
>>>>> tesseract 272525030292764523137280353496213864766.png - -l eng --psm 3 
>>>>> quiet
>>>>> can only detect those words
>>>>> "Log in
>>>>> Username
>>>>> Password
>>>>> Cancel"
>>>>>
>>>>> I submit this picture to several online pic->txt converters. they work 
>>>>> well, detecting most of the text in the pic.
>>>>> For example, https://www.imagetotext.info/ it claims that it use 
>>>>> tesseract 
>>>>>
>>>>> I am not sure if I use tesseract correctly.
>>>>> Does another can help test what's your detection result of this 
>>>>> picture?  
>>>>> Thanks
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zc4pyY%2BGJfVGrJ-yDMTo1tLn9DA502FJeB_V%3DLKi5p%2BQ%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zc4pyY%2BGJfVGrJ-yDMTo1tLn9DA502FJeB_V%3DLKi5p%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb31533d-df36-4355-9d13-f79b7c2f00f7n%40googlegroups.com.

[tesseract-ocr] Re: Regarding for OCR guidance ...

Reply via email to