Re: [tesseract-ocr] Tessarct won't recognise single characters

Iain Downs Sun, 14 Jul 2024 06:21:22 -0700

For those interested, the c# nuget package  Tesseract.OCR ALSO ignores the 
page numbers with a simple test program.  The possibly slightly older and 
better known c# package Tesseract does not load properly from Nuget - 
probably something I'm doing, but I can't image what!


Iain

On Sunday, July 14, 2024 at 8:20:59 AM UTC+1 Iain Downs wrote:

> I have FINALLY got the c++ samples working in Visual Studio 2022. The code 
> I am using is the first tesseract sample code from here 
> <https://tesseract-ocr.github.io/tessdoc/Examples_C++.html> .
>
> Bizarrely, this simple code finds the page numbers at the bottom of the 
> page perfectly happily, whereas the tesseract executable did not.  This is 
> good news - though confusing...
>
> Thanks to all for your input on this - I think for the moment I'm enough 
> ahead that I can call this issue closed.  I will be seeing if I can 
> replicate this in c# which is a more productive environment for me than C++.
>
> Iain
>
>
> On Sunday, July 14, 2024 at 7:47:47 AM UTC+1 Iain Downs wrote:
>
>> Apologies.  Python file in the google groups but for some reason didn’t 
>> come down with the email.
>>
>>  
>>
>> Also, I now have a sample program (nearly) working in C++.  My last step 
>> was to copy all the dlls from the vcpkg install into the source directory, 
>> otherwise they weren’t found when running.  I’m left with setting the 
>> location of the language file and it should work.  But the python will be 
>> helpful nonetheless.
>>
>>  
>>
>> Iain
>>
>>  
>>
>> *From:* [email protected] [mailto:[email protected]] *On 
>> Behalf Of *Dominic Mukilan
>> *Sent:* 13 July 2024 17:42
>> *To:* [email protected]
>> *Subject:* Re: [tesseract-ocr] Tessarct won't recognise single characters
>>
>>  
>>
>> Attaching the python file, the supporting files, and requirements.txt
>>
>>  
>>
>> On Sat, 13 Jul 2024 at 21:56, Iain Downs <[email protected]> wrote:
>>
>> Can you give me some example code?  I'm currently trying to get tesseract 
>> working for C++ in Visual Studio and it's a bit of a nightmare.  python 
>> seems easier though it's not one of my main languages - I can try it out 
>> though!
>>
>>  
>>
>> Iain
>>
>> On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 [email protected] wrote:
>>
>> Hi,
>>
>> I try your example with tesseract for python - it works well
>>
>>  
>>
>> Le jeu. 11 juil. 2024 à 20:35, Iain Downs <[email protected]> a écrit :
>>
>> I'm trying to extract page numbers from scanned pages of text.  Page 
>> Numbers are either at the top or at the bottom - sometimes with titles / 
>> authors / chapters.  Occasionally elsewhere, but I don't care about the 
>> exceptions.
>>
>>  
>>
>> I've loaded tesseract 5.4 (windows) and run some tests using the 
>> executable.  I'm finding that if the page number is a single digit on the 
>> line, tesseract ignores it (but otherwise does a fantastic job of OCR even 
>> with skewed and noisy images).
>>
>>  
>>
>> I've isolated the single line used that as input and tesseract tells me 
>> 'the page is empty'.
>>
>>  
>>
>> Here is a sample of a single line with a '1' in it resolution is 300dpi.
>>
>> [image: Image removed by sender. 101_bottom.jpg]
>>
>>  
>>
>> Ultimately I would be writing a program using tesseract, but in the first 
>> instance I'd like to see it work with the exe.
>>
>>  
>>
>> So, can I tell tesseract to be less fussy with individual characters and 
>> if not how would I do so programatically - if possible?
>>
>>  
>>
>> Thanks
>>
>>  
>>
>> Iain
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/tesseract-ocr/AI48y7_QMlg/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAOrS2tW_CUVUsOv%3DAXanD2947Q29xC8hO1z6kzXLciix8XHbJA%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAOrS2tW_CUVUsOv%3DAXanD2947Q29xC8hO1z6kzXLciix8XHbJA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/354d5eba-79e1-4997-8243-a2fca0e49803n%40googlegroups.com.

Re: [tesseract-ocr] Tessarct won't recognise single characters

Reply via email to