this is my old snippet, so part of the code is useless for pdf rendering (opening the input image as PIX).
Zdenko po 22. 11. 2021 o 14:28 Zdenko Podobny <zde...@gmail.com> napísal(a): > Here is a simple code, that works for me (with tesseract 5 and leptonica > 1.82) > > #include <leptonica/allheaders.h> > #include <tesseract/baseapi.h> > #include <tesseract/renderer.h> > #include <string> > > int main() { > const char* datapath = "f:/Project-Personal/tessdata_best/tessdata"; > std::string language_ = "eng"; > std::string inputFile_ = "input.png"; > const char* outputbase = "output"; > > tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); > if (api->Init(datapath, language_.c_str(), tesseract::OEM_LSTM_ONLY)) { > fprintf(stderr, "Could not initialize tesseract.\n"); > exit(1); > } > > PIX *sourceImg = pixRead(inputFile_.c_str()); > if (!sourceImg) { > fprintf(stderr, "Leptonica can't process input file: %s\n", > inputFile_.c_str()); > return EXIT_FAILURE; > } > api->SetImage(sourceImg); > api->SetInputName(inputFile_.c_str()); > api->SetOutputName(outputbase); > > tesseract::TessPDFRenderer* renderer = > new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()); > if (!renderer->happy()) { > printf("Error, could not create PDF output file: %s\n", > strerror(errno)); > delete renderer; > } > > bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, > renderer); > if (!succeed) { > fprintf(stderr, "Error during processing.\n"); > return EXIT_FAILURE; > } > > api->End(); > pixDestroy(&sourceImg); > return 0; > } > > > Zdenko > > > ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr < > tesseract-ocr@googlegroups.com> napísal(a): > >> Hi zdenop, >> >> thanks for your tip, but I'm using the ProcessPage*s* function, so it >> should write the head and footer part of the file itself. >> BUT I've played a bit with ProcessPage() + BeginDocument() before and >> EndDocument() after and the resulting file has big differences. Sadly, the >> file is still corrupt. >> >> So it seems the problem is based on the failing begin/enddocument >> function. But even there I'm experiencing mysterious bugs. >> Using only EndDocument(), I have something like a footer at the end of >> the file: >> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: >> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png] >> >> But it suddenly stops at "Produce". But when I'm using BeginDocument(), >> ProcessPage() and then EndDocument() the file is ending with bytes and >> there is no "endstream" or "endobj". >> I've updated to latest 4.1.3 version but problem still exists. >> >> I updated the bug branch in >> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the >> problem is reproducible. >> To disable the BeginDocument, one have to comment out >> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187 >> . >> >> I tried to use 1:1 the code from the tesseract cli but it still does not >> work... >> >> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1: >> >>> seems like the same problem as >>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885 >>> >>> Did you use BeginDocument EndDocument ? >>> >>> Zdenko >>> >>> >>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr < >>> tesser...@googlegroups.com> napísal(a): >>> >>>> Described already in issue: >>>> https://github.com/tesseract-ocr/tesseract/issues/3652 >>>> >>>> I'm trying to generate a searchable PDF outgoing from a jpg image, but >>>> the file that gets output is an invalid pdf file that can't be read by any >>>> pdf reader. >>>> >>>> I have added an docker image for reproduction of the problem in the >>>> issue, but here is the bash snippet for it: >>>> >>>> *git clone g...@github.com:dnnspaul/gosseract.git* >>>> *git checkout tesseract/bug/3652* >>>> >>>> *docker build -t tessbug .* >>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go* >>>> >>>> When I'm inputting the file in the tesseract cli, the outcoming pdf is >>>> readable, but I can't find any difference between the cli and my snippet. >>>> >>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang >>>> developer, than a C ++ developer so I have kind of problems with the >>>> simplest syntax, but tried my best. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xDsKQWuMxdo1xfUmd4hjHQUwUo%2BQGdjsGyXMK%3D8Y89Nw%40mail.gmail.com.