Here is a simple code, that works for me (with tesseract 5 and leptonica 1.82)
#include <leptonica/allheaders.h> #include <tesseract/baseapi.h> #include <tesseract/renderer.h> #include <string> int main() { const char* datapath = "f:/Project-Personal/tessdata_best/tessdata"; std::string language_ = "eng"; std::string inputFile_ = "input.png"; const char* outputbase = "output"; tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); if (api->Init(datapath, language_.c_str(), tesseract::OEM_LSTM_ONLY)) { fprintf(stderr, "Could not initialize tesseract.\n"); exit(1); } PIX *sourceImg = pixRead(inputFile_.c_str()); if (!sourceImg) { fprintf(stderr, "Leptonica can't process input file: %s\n", inputFile_.c_str()); return EXIT_FAILURE; } api->SetImage(sourceImg); api->SetInputName(inputFile_.c_str()); api->SetOutputName(outputbase); tesseract::TessPDFRenderer* renderer = new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()); if (!renderer->happy()) { printf("Error, could not create PDF output file: %s\n", strerror(errno)); delete renderer; } bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, renderer); if (!succeed) { fprintf(stderr, "Error during processing.\n"); return EXIT_FAILURE; } api->End(); pixDestroy(&sourceImg); return 0; } Zdenko ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Hi zdenop, > > thanks for your tip, but I'm using the ProcessPage*s* function, so it > should write the head and footer part of the file itself. > BUT I've played a bit with ProcessPage() + BeginDocument() before and > EndDocument() after and the resulting file has big differences. Sadly, the > file is still corrupt. > > So it seems the problem is based on the failing begin/enddocument > function. But even there I'm experiencing mysterious bugs. > Using only EndDocument(), I have something like a footer at the end of the > file: > [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: > root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png] > > But it suddenly stops at "Produce". But when I'm using BeginDocument(), > ProcessPage() and then EndDocument() the file is ending with bytes and > there is no "endstream" or "endobj". > I've updated to latest 4.1.3 version but problem still exists. > > I updated the bug branch in > https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the > problem is reproducible. > To disable the BeginDocument, one have to comment out > https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187 > . > > I tried to use 1:1 the code from the tesseract cli but it still does not > work... > > zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1: > >> seems like the same problem as >> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885 >> >> Did you use BeginDocument EndDocument ? >> >> Zdenko >> >> >> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr < >> tesser...@googlegroups.com> napísal(a): >> >>> Described already in issue: >>> https://github.com/tesseract-ocr/tesseract/issues/3652 >>> >>> I'm trying to generate a searchable PDF outgoing from a jpg image, but >>> the file that gets output is an invalid pdf file that can't be read by any >>> pdf reader. >>> >>> I have added an docker image for reproduction of the problem in the >>> issue, but here is the bash snippet for it: >>> >>> *git clone g...@github.com:dnnspaul/gosseract.git* >>> *git checkout tesseract/bug/3652* >>> >>> *docker build -t tessbug .* >>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go* >>> >>> When I'm inputting the file in the tesseract cli, the outcoming pdf is >>> readable, but I can't find any difference between the cli and my snippet. >>> >>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang >>> developer, than a C ++ developer so I have kind of problems with the >>> simplest syntax, but tried my best. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x%2B58UYjqq-zr0C2f%3Dazs0_RTVs%3D4p1a9PVu%2BumLOW43Q%40mail.gmail.com.