Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Zdenko Podobny Mon, 22 Nov 2021 05:33:14 -0800

this is my old snippet, so part of the code is useless for pdf rendering
(opening the input image as PIX).


Zdenko


po 22. 11. 2021 o 14:28 Zdenko Podobny <zde...@gmail.com> napísal(a):

> Here is a simple code, that works for me (with tesseract 5 and leptonica
> 1.82)
>
> #include <leptonica/allheaders.h>
> #include <tesseract/baseapi.h>
> #include <tesseract/renderer.h>
> #include <string>
>
> int main() {
>     const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
>     std::string language_ = "eng";
>     std::string inputFile_ = "input.png";
>     const char* outputbase = "output";
>
>     tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>     if (api->Init(datapath, language_.c_str(), tesseract::OEM_LSTM_ONLY)) {
>         fprintf(stderr, "Could not initialize tesseract.\n");
>         exit(1);
>     }
>
>     PIX *sourceImg = pixRead(inputFile_.c_str());
>     if (!sourceImg) {
>         fprintf(stderr, "Leptonica can't process input file: %s\n",
>                 inputFile_.c_str());
>         return EXIT_FAILURE;
>     }
>     api->SetImage(sourceImg);
>     api->SetInputName(inputFile_.c_str());
>     api->SetOutputName(outputbase);
>
>     tesseract::TessPDFRenderer* renderer =
>         new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
>     if (!renderer->happy()) {
>         printf("Error, could not create PDF output file: %s\n",
>                strerror(errno));
>         delete renderer;
>     }
>
>     bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0,
> renderer);
>     if (!succeed) {
>         fprintf(stderr, "Error during processing.\n");
>         return EXIT_FAILURE;
>     }
>
>     api->End();
>     pixDestroy(&sourceImg);
>     return 0;
> }
>
>
> Zdenko
>
>
> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <
> tesseract-ocr@googlegroups.com> napísal(a):
>
>> Hi zdenop,
>>
>> thanks for your tip, but I'm using the ProcessPage*s* function, so it
>> should write the head and footer part of the file itself.
>> BUT I've played a bit with ProcessPage() + BeginDocument() before and
>> EndDocument() after and the resulting file has big differences. Sadly, the
>> file is still corrupt.
>>
>> So it seems the problem is based on the failing begin/enddocument
>> function. But even there I'm experiencing mysterious bugs.
>> Using only EndDocument(), I have something like a footer at the end of
>> the file:
>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH:
>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]
>>
>> But it suddenly stops at "Produce". But when I'm using BeginDocument(),
>> ProcessPage() and then EndDocument() the file is ending with bytes and
>> there is no "endstream" or "endobj".
>> I've updated to latest 4.1.3 version but problem still exists.
>>
>> I updated the bug branch in
>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the
>> problem is reproducible.
>> To disable the BeginDocument, one have to comment out
>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187
>> .
>>
>> I tried to use 1:1 the code from the tesseract cli but it still does not
>> work...
>>
>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:
>>
>>> seems like the same problem as
>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>>>
>>> Did you use  BeginDocument EndDocument ?
>>>
>>> Zdenko
>>>
>>>
>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
>>> tesser...@googlegroups.com> napísal(a):
>>>
>>>> Described already in issue:
>>>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>>>
>>>> I'm trying to generate a searchable PDF outgoing from a jpg image, but
>>>> the file that gets output is an invalid pdf file that can't be read by any
>>>> pdf reader.
>>>>
>>>> I have added an docker image for reproduction of the problem in the
>>>> issue, but here is the bash snippet for it:
>>>>
>>>> *git clone g...@github.com:dnnspaul/gosseract.git*
>>>> *git checkout tesseract/bug/3652*
>>>>
>>>> *docker build -t tessbug .*
>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>>>
>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf is
>>>> readable, but I can't find any difference between the cli and my snippet.
>>>>
>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang
>>>> developer, than a C ++ developer so I have kind of problems with the
>>>> simplest syntax, but tried my best.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xDsKQWuMxdo1xfUmd4hjHQUwUo%2BQGdjsGyXMK%3D8Y89Nw%40mail.gmail.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to