Zdenop! Great news! :D I recompiled OpenCV on my machine and somehow it resolved the problem. Now I can use v5.0.0 and opencv without any problems. Seems like openCV depended on old libs in /usr/local/lib (it always searched for libtesseract.so.4 but there was no file because I only installed v5). Probably it was an easy problem for a C developer, but like I said I'm just a entry-level golang developer.
So thank you very very much! zdenop schrieb am Montag, 22. November 2021 um 22:27:47 UTC+1: > Hello, > > yes, it works for me also with tesseract 4.1.3 (the latest version). AFAIR > there was no change in behaviour of renderer (including TessPDFRenderer) > from the 4.0-beta version. > Also, I did not get your problem with OpenCV - AFAIK tesseract is the only > optional dependency and it uses only very limited tesseract features[1]. > Because you will use anyway tesseract directly for creating pdf, it does > not make sense to care about old tesseract support in OpenCV. > > [1] > https://docs.opencv.org/4.5.4/d7/ddc/classcv_1_1text_1_1OCRTesseract.html > > > Zdenko > > > po 22. 11. 2021 o 19:54 'blaumedia' via tesseract-ocr < > tesser...@googlegroups.com> napísal(a): > >> Hey zdenop, >> >> turns out I can't rely on 5.0.0, because OpenCV seems to only is >> compatible with 4.x yet. (OpenCV is another requirement of my project). >> Does your script from above works on tesseract 4.x for you? >> >> blaumedia schrieb am Montag, 22. November 2021 um 18:51:38 UTC+1: >> >>> It works! >>> >>> I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and >>> your code worked flawlessly. It seems like the 4.1.3 has a bug in it, that >>> has been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be >>> more unstable. >>> I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.7.x somewhat before) >>> and the problem with corrupt pdf still exists. But that's not a problem, I >>> will use 5.0.0 instead. >>> >>> Thank you zdenop! >>> >>> zdenop schrieb am Montag, 22. November 2021 um 14:33:15 UTC+1: >>> >>>> this is my old snippet, so part of the code is useless for pdf >>>> rendering (opening the input image as PIX). >>>> >>>> Zdenko >>>> >>>> >>>> po 22. 11. 2021 o 14:28 Zdenko Podobny <zde...@gmail.com> napísal(a): >>>> >>>>> Here is a simple code, that works for me (with tesseract 5 and >>>>> leptonica 1.82) >>>>> >>>>> #include <leptonica/allheaders.h> >>>>> #include <tesseract/baseapi.h> >>>>> #include <tesseract/renderer.h> >>>>> #include <string> >>>>> >>>>> int main() { >>>>> const char* datapath = >>>>> "f:/Project-Personal/tessdata_best/tessdata"; >>>>> std::string language_ = "eng"; >>>>> std::string inputFile_ = "input.png"; >>>>> const char* outputbase = "output"; >>>>> >>>>> tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); >>>>> if (api->Init(datapath, language_.c_str(), >>>>> tesseract::OEM_LSTM_ONLY)) { >>>>> fprintf(stderr, "Could not initialize tesseract.\n"); >>>>> exit(1); >>>>> } >>>>> >>>>> PIX *sourceImg = pixRead(inputFile_.c_str()); >>>>> if (!sourceImg) { >>>>> fprintf(stderr, "Leptonica can't process input file: %s\n", >>>>> inputFile_.c_str()); >>>>> return EXIT_FAILURE; >>>>> } >>>>> api->SetImage(sourceImg); >>>>> api->SetInputName(inputFile_.c_str()); >>>>> api->SetOutputName(outputbase); >>>>> >>>>> tesseract::TessPDFRenderer* renderer = >>>>> new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()); >>>>> if (!renderer->happy()) { >>>>> printf("Error, could not create PDF output file: %s\n", >>>>> strerror(errno)); >>>>> delete renderer; >>>>> } >>>>> >>>>> bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, >>>>> renderer); >>>>> if (!succeed) { >>>>> fprintf(stderr, "Error during processing.\n"); >>>>> return EXIT_FAILURE; >>>>> } >>>>> >>>>> api->End(); >>>>> pixDestroy(&sourceImg); >>>>> return 0; >>>>> } >>>>> >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr < >>>>> tesser...@googlegroups.com> napísal(a): >>>>> >>>>>> Hi zdenop, >>>>>> >>>>>> thanks for your tip, but I'm using the ProcessPage*s* function, so >>>>>> it should write the head and footer part of the file itself. >>>>>> BUT I've played a bit with ProcessPage() + BeginDocument() before and >>>>>> EndDocument() after and the resulting file has big differences. Sadly, >>>>>> the >>>>>> file is still corrupt. >>>>>> >>>>>> So it seems the problem is based on the failing begin/enddocument >>>>>> function. But even there I'm experiencing mysterious bugs. >>>>>> Using only EndDocument(), I have something like a footer at the end >>>>>> of the file: >>>>>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: >>>>>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png] >>>>>> >>>>>> But it suddenly stops at "Produce". But when I'm using >>>>>> BeginDocument(), ProcessPage() and then EndDocument() the file is ending >>>>>> with bytes and there is no "endstream" or "endobj". >>>>>> I've updated to latest 4.1.3 version but problem still exists. >>>>>> >>>>>> I updated the bug branch in >>>>>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so >>>>>> the problem is reproducible. >>>>>> To disable the BeginDocument, one have to comment out >>>>>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187 >>>>>> . >>>>>> >>>>>> I tried to use 1:1 the code from the tesseract cli but it still does >>>>>> not work... >>>>>> >>>>>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1: >>>>>> >>>>>>> seems like the same problem as >>>>>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885 >>>>>>> >>>>>>> Did you use BeginDocument EndDocument ? >>>>>>> >>>>>>> Zdenko >>>>>>> >>>>>>> >>>>>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr < >>>>>>> tesser...@googlegroups.com> napísal(a): >>>>>>> >>>>>>>> Described already in issue: >>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/3652 >>>>>>>> >>>>>>>> I'm trying to generate a searchable PDF outgoing from a jpg image, >>>>>>>> but the file that gets output is an invalid pdf file that can't be >>>>>>>> read by >>>>>>>> any pdf reader. >>>>>>>> >>>>>>>> I have added an docker image for reproduction of the problem in the >>>>>>>> issue, but here is the bash snippet for it: >>>>>>>> >>>>>>>> *git clone g...@github.com:dnnspaul/gosseract.git* >>>>>>>> *git checkout tesseract/bug/3652* >>>>>>>> >>>>>>>> *docker build -t tessbug .* >>>>>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go* >>>>>>>> >>>>>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf >>>>>>>> is readable, but I can't find any difference between the cli and my >>>>>>>> snippet. >>>>>>>> >>>>>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang >>>>>>>> developer, than a C ++ developer so I have kind of problems with the >>>>>>>> simplest syntax, but tried my best. >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/41f1ea56-a486-4f90-ba80-6f8ee9be949fn%40googlegroups.com.