Have you tried compiling/building the examples code that comes with tesseract?
That should give some reasonable initial results - I can't comment on the autoconf or cmake stuff that comes with tesseract as I have my own c/c++ build rig for msvc, but the only real nuisance -- as far as I am concerned -- is the pango lib and you dont need that unless you want the training tool text2image to work as well. Also rtfc'ing the tesseract cli source file itself might help, but, yeah, it ain't for rookies, shall we say. If you haven't got experience with other large "technical debt" codebases, then I can full well understand that it isn't easy to get tesseract to complete building. On Sat, 13 Jul 2024, 18:26 Iain Downs, <i...@idcl.co.uk> wrote: > Can you give me some example code? I'm currently trying to get tesseract > working for C++ in Visual Studio and it's a bit of a nightmare. python > seems easier though it's not one of my main languages - I can try it out > though! > > Iain > > On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 renec...@gmail.com wrote: > >> Hi, >> I try your example with tesseract for python - it works well >> >> Le jeu. 11 juil. 2024 à 20:35, Iain Downs <ia...@idcl.co.uk> a écrit : >> >>> I'm trying to extract page numbers from scanned pages of text. Page >>> Numbers are either at the top or at the bottom - sometimes with titles / >>> authors / chapters. Occasionally elsewhere, but I don't care about the >>> exceptions. >>> >>> I've loaded tesseract 5.4 (windows) and run some tests using the >>> executable. I'm finding that if the page number is a single digit on the >>> line, tesseract ignores it (but otherwise does a fantastic job of OCR even >>> with skewed and noisy images). >>> >>> I've isolated the single line used that as input and tesseract tells me >>> 'the page is empty'. >>> >>> Here is a sample of a single line with a '1' in it resolution is 300dpi. >>> [image: 101_bottom.jpg] >>> >>> Ultimately I would be writing a program using tesseract, but in the >>> first instance I'd like to see it work with the exe. >>> >>> So, can I tell tesseract to be less fussy with individual characters and >>> if not how would I do so programatically - if possible? >>> >>> Thanks >>> >>> Iain >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fqzXV2CvqsU7g9KyDf3ONJKOxJUP7jyQpgS7VJvpoEQqw%40mail.gmail.com.