Hello, yes, it works for me also with tesseract 4.1.3 (the latest version). AFAIR there was no change in behaviour of renderer (including TessPDFRenderer) from the 4.0-beta version. Also, I did not get your problem with OpenCV - AFAIK tesseract is the only optional dependency and it uses only very limited tesseract features[1]. Because you will use anyway tesseract directly for creating pdf, it does not make sense to care about old tesseract support in OpenCV.
[1] https://docs.opencv.org/4.5.4/d7/ddc/classcv_1_1text_1_1OCRTesseract.html Zdenko po 22. 11. 2021 o 19:54 'blaumedia' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Hey zdenop, > > turns out I can't rely on 5.0.0, because OpenCV seems to only is > compatible with 4.x yet. (OpenCV is another requirement of my project). > Does your script from above works on tesseract 4.x for you? > > blaumedia schrieb am Montag, 22. November 2021 um 18:51:38 UTC+1: > >> It works! >> >> I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and >> your code worked flawlessly. It seems like the 4.1.3 has a bug in it, that >> has been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be >> more unstable. >> I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.7.x somewhat before) and >> the problem with corrupt pdf still exists. But that's not a problem, I will >> use 5.0.0 instead. >> >> Thank you zdenop! >> >> zdenop schrieb am Montag, 22. November 2021 um 14:33:15 UTC+1: >> >>> this is my old snippet, so part of the code is useless for pdf rendering >>> (opening the input image as PIX). >>> >>> Zdenko >>> >>> >>> po 22. 11. 2021 o 14:28 Zdenko Podobny <zde...@gmail.com> napísal(a): >>> >>>> Here is a simple code, that works for me (with tesseract 5 and >>>> leptonica 1.82) >>>> >>>> #include <leptonica/allheaders.h> >>>> #include <tesseract/baseapi.h> >>>> #include <tesseract/renderer.h> >>>> #include <string> >>>> >>>> int main() { >>>> const char* datapath = "f:/Project-Personal/tessdata_best/tessdata"; >>>> std::string language_ = "eng"; >>>> std::string inputFile_ = "input.png"; >>>> const char* outputbase = "output"; >>>> >>>> tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); >>>> if (api->Init(datapath, language_.c_str(), >>>> tesseract::OEM_LSTM_ONLY)) { >>>> fprintf(stderr, "Could not initialize tesseract.\n"); >>>> exit(1); >>>> } >>>> >>>> PIX *sourceImg = pixRead(inputFile_.c_str()); >>>> if (!sourceImg) { >>>> fprintf(stderr, "Leptonica can't process input file: %s\n", >>>> inputFile_.c_str()); >>>> return EXIT_FAILURE; >>>> } >>>> api->SetImage(sourceImg); >>>> api->SetInputName(inputFile_.c_str()); >>>> api->SetOutputName(outputbase); >>>> >>>> tesseract::TessPDFRenderer* renderer = >>>> new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()); >>>> if (!renderer->happy()) { >>>> printf("Error, could not create PDF output file: %s\n", >>>> strerror(errno)); >>>> delete renderer; >>>> } >>>> >>>> bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, >>>> renderer); >>>> if (!succeed) { >>>> fprintf(stderr, "Error during processing.\n"); >>>> return EXIT_FAILURE; >>>> } >>>> >>>> api->End(); >>>> pixDestroy(&sourceImg); >>>> return 0; >>>> } >>>> >>>> >>>> Zdenko >>>> >>>> >>>> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr < >>>> tesser...@googlegroups.com> napísal(a): >>>> >>>>> Hi zdenop, >>>>> >>>>> thanks for your tip, but I'm using the ProcessPage*s* function, so it >>>>> should write the head and footer part of the file itself. >>>>> BUT I've played a bit with ProcessPage() + BeginDocument() before and >>>>> EndDocument() after and the resulting file has big differences. Sadly, the >>>>> file is still corrupt. >>>>> >>>>> So it seems the problem is based on the failing begin/enddocument >>>>> function. But even there I'm experiencing mysterious bugs. >>>>> Using only EndDocument(), I have something like a footer at the end of >>>>> the file: >>>>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: >>>>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png] >>>>> >>>>> But it suddenly stops at "Produce". But when I'm using >>>>> BeginDocument(), ProcessPage() and then EndDocument() the file is ending >>>>> with bytes and there is no "endstream" or "endobj". >>>>> I've updated to latest 4.1.3 version but problem still exists. >>>>> >>>>> I updated the bug branch in >>>>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so >>>>> the problem is reproducible. >>>>> To disable the BeginDocument, one have to comment out >>>>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187 >>>>> . >>>>> >>>>> I tried to use 1:1 the code from the tesseract cli but it still does >>>>> not work... >>>>> >>>>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1: >>>>> >>>>>> seems like the same problem as >>>>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885 >>>>>> >>>>>> Did you use BeginDocument EndDocument ? >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr < >>>>>> tesser...@googlegroups.com> napísal(a): >>>>>> >>>>>>> Described already in issue: >>>>>>> https://github.com/tesseract-ocr/tesseract/issues/3652 >>>>>>> >>>>>>> I'm trying to generate a searchable PDF outgoing from a jpg image, >>>>>>> but the file that gets output is an invalid pdf file that can't be read >>>>>>> by >>>>>>> any pdf reader. >>>>>>> >>>>>>> I have added an docker image for reproduction of the problem in the >>>>>>> issue, but here is the bash snippet for it: >>>>>>> >>>>>>> *git clone g...@github.com:dnnspaul/gosseract.git* >>>>>>> *git checkout tesseract/bug/3652* >>>>>>> >>>>>>> *docker build -t tessbug .* >>>>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go* >>>>>>> >>>>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf >>>>>>> is readable, but I can't find any difference between the cli and my >>>>>>> snippet. >>>>>>> >>>>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang >>>>>>> developer, than a C ++ developer so I have kind of problems with the >>>>>>> simplest syntax, but tried my best. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zFd1q_8WFuEHhfoOZ5n4yrWK_7QWV1Q5KHFj%2B5MqLtUg%40mail.gmail.com.