Re: [tesseract-ocr] Tesseract on python code

2021-11-22 Thread Zdenko Podobny
OCR of source code with tesseract is a problem: - tesseract is not focused on keeping spaces/indentation - you have to reconstruct it by yourself (e.g. by parsing horcr output) - tesseract is focused more on "real" text, while source code is more symbolic with a lot of extra character,

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

2021-11-22 Thread Zdenko Podobny
Here is a simple code, that works for me (with tesseract 5 and leptonica 1.82) #include #include #include #include int main() { const char* datapath = "f:/Project-Personal/tessdata_best/tessdata"; std::string language_ = "eng"; std::string inputFile_ = "input.png"; const char*

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

2021-11-22 Thread Zdenko Podobny
this is my old snippet, so part of the code is useless for pdf rendering (opening the input image as PIX). Zdenko po 22. 11. 2021 o 14:28 Zdenko Podobny napísal(a): > Here is a simple code, that works for me (with tesseract 5 and leptonica > 1.82) > > #include > #include > #include > #inclu

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

2021-11-22 Thread Sarah Jane CHANNEL
this code can read text? On Mon, 22 Nov 2021, 21:28 Zdenko Podobny, wrote: > Here is a simple code, that works for me (with tesseract 5 and leptonica > 1.82) > > #include > #include > #include > #include > > int main() { > const char* datapath = "f:/Project-Personal/tessdata_best/tessdat

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

2021-11-22 Thread Zdenko Podobny
I do not understand your question: how it is related to the discussed topic? Zdenko po 22. 11. 2021 o 14:34 Sarah Jane CHANNEL napísal(a): > this code can read text? > > On Mon, 22 Nov 2021, 21:28 Zdenko Podobny, wrote: > >> Here is a simple code, that works for me (with tesseract 5 and lepto

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

2021-11-22 Thread 'blaumedia' via tesseract-ocr
It works! I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and your code worked flawlessly. It seems like the 4.1.3 has a bug in it, that has been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be more unstable. I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

2021-11-22 Thread 'blaumedia' via tesseract-ocr
Hey zdenop, turns out I can't rely on 5.0.0, because OpenCV seems to only is compatible with 4.x yet. (OpenCV is another requirement of my project). Does your script from above works on tesseract 4.x for you? blaumedia schrieb am Montag, 22. November 2021 um 18:51:38 UTC+1: > It works! > > I tr

Re: [tesseract-ocr] Tesseract on python code

2021-11-22 Thread J S
Thanks a lot Zdenko, I am disappointed but th'as life :-( Le lundi 22 novembre 2021 à 12:42:23 UTC+1, zdenop a écrit : > OCR of source code with tesseract is a problem: > >- tesseract is not focused on keeping spaces/indentation - you have to >reconstruct it by yourself (e.g. by parsin

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

2021-11-22 Thread Zdenko Podobny
Hello, yes, it works for me also with tesseract 4.1.3 (the latest version). AFAIR there was no change in behaviour of renderer (including TessPDFRenderer) from the 4.0-beta version. Also, I did not get your problem with OpenCV - AFAIK tesseract is the only optional dependency and it uses only very