Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Zdenko Podobny Mon, 22 Nov 2021 05:29:01 -0800

Here is a simple code, that works for me (with tesseract 5 and leptonica
1.82)


#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>
#include <string>

int main() {
    const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
    std::string language_ = "eng";
    std::string inputFile_ = "input.png";
    const char* outputbase = "output";

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init(datapath, language_.c_str(), tesseract::OEM_LSTM_ONLY)) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    PIX *sourceImg = pixRead(inputFile_.c_str());
    if (!sourceImg) {
        fprintf(stderr, "Leptonica can't process input file: %s\n",
                inputFile_.c_str());
        return EXIT_FAILURE;
    }
    api->SetImage(sourceImg);
    api->SetInputName(inputFile_.c_str());
    api->SetOutputName(outputbase);

    tesseract::TessPDFRenderer* renderer =
        new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
    if (!renderer->happy()) {
        printf("Error, could not create PDF output file: %s\n",
               strerror(errno));
        delete renderer;
    }

    bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0,
renderer);
    if (!succeed) {
        fprintf(stderr, "Error during processing.\n");
        return EXIT_FAILURE;
    }

    api->End();
    pixDestroy(&sourceImg);
    return 0;
}


Zdenko


ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hi zdenop,
>
> thanks for your tip, but I'm using the ProcessPage*s* function, so it
> should write the head and footer part of the file itself.
> BUT I've played a bit with ProcessPage() + BeginDocument() before and
> EndDocument() after and the resulting file has big differences. Sadly, the
> file is still corrupt.
>
> So it seems the problem is based on the failing begin/enddocument
> function. But even there I'm experiencing mysterious bugs.
> Using only EndDocument(), I have something like a footer at the end of the
> file:
> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH:
> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]
>
> But it suddenly stops at "Produce". But when I'm using BeginDocument(),
> ProcessPage() and then EndDocument() the file is ending with bytes and
> there is no "endstream" or "endobj".
> I've updated to latest 4.1.3 version but problem still exists.
>
> I updated the bug branch in
> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the
> problem is reproducible.
> To disable the BeginDocument, one have to comment out
> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187
> .
>
> I tried to use 1:1 the code from the tesseract cli but it still does not
> work...
>
> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:
>
>> seems like the same problem as
>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>>
>> Did you use  BeginDocument EndDocument ?
>>
>> Zdenko
>>
>>
>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
>> tesser...@googlegroups.com> napísal(a):
>>
>>> Described already in issue:
>>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>>
>>> I'm trying to generate a searchable PDF outgoing from a jpg image, but
>>> the file that gets output is an invalid pdf file that can't be read by any
>>> pdf reader.
>>>
>>> I have added an docker image for reproduction of the problem in the
>>> issue, but here is the bash snippet for it:
>>>
>>> *git clone g...@github.com:dnnspaul/gosseract.git*
>>> *git checkout tesseract/bug/3652*
>>>
>>> *docker build -t tessbug .*
>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>>
>>> When I'm inputting the file in the tesseract cli, the outcoming pdf is
>>> readable, but I can't find any difference between the cli and my snippet.
>>>
>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang
>>> developer, than a C ++ developer so I have kind of problems with the
>>> simplest syntax, but tried my best.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x%2B58UYjqq-zr0C2f%3Dazs0_RTVs%3D4p1a9PVu%2BumLOW43Q%40mail.gmail.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to