Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

'blaumedia' via tesseract-ocr Mon, 22 Nov 2021 09:51:44 -0800

It works!

I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and your 
code worked flawlessly. It seems like the 4.1.3 has a bug in it, that has 
been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be 
more unstable.
I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.7.x somewhat before) and 
the problem with corrupt pdf still exists. But that's not a problem, I will 
use 5.0.0 instead.


Thank you zdenop!

zdenop schrieb am Montag, 22. November 2021 um 14:33:15 UTC+1:

> this is my old snippet, so part of the code is useless for pdf rendering 
> (opening the input image as PIX).
>
> Zdenko
>
>
> po 22. 11. 2021 o 14:28 Zdenko Podobny <[email protected]> napísal(a):
>
>> Here is a simple code, that works for me (with tesseract 5 and leptonica 
>> 1.82)
>>
>> #include <leptonica/allheaders.h>
>> #include <tesseract/baseapi.h>
>> #include <tesseract/renderer.h>
>> #include <string>
>>
>> int main() {
>>     const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
>>     std::string language_ = "eng";
>>     std::string inputFile_ = "input.png";
>>     const char* outputbase = "output";
>>
>>     tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>>     if (api->Init(datapath, language_.c_str(), tesseract::OEM_LSTM_ONLY)) 
>> {
>>         fprintf(stderr, "Could not initialize tesseract.\n");
>>         exit(1);
>>     }
>>
>>     PIX *sourceImg = pixRead(inputFile_.c_str());
>>     if (!sourceImg) {
>>         fprintf(stderr, "Leptonica can't process input file: %s\n",
>>                 inputFile_.c_str());
>>         return EXIT_FAILURE;
>>     }
>>     api->SetImage(sourceImg);
>>     api->SetInputName(inputFile_.c_str());
>>     api->SetOutputName(outputbase);
>>
>>     tesseract::TessPDFRenderer* renderer =
>>         new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
>>     if (!renderer->happy()) {
>>         printf("Error, could not create PDF output file: %s\n",
>>                strerror(errno));
>>         delete renderer;
>>     }
>>
>>     bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, 
>> renderer);
>>     if (!succeed) {
>>         fprintf(stderr, "Error during processing.\n");
>>         return EXIT_FAILURE;
>>     }
>>
>>     api->End();
>>     pixDestroy(&sourceImg);
>>     return 0;
>> }
>>
>>
>> Zdenko
>>
>>
>> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <
>> [email protected]> napísal(a):
>>
>>> Hi zdenop,
>>>
>>> thanks for your tip, but I'm using the ProcessPage*s* function, so it 
>>> should write the head and footer part of the file itself.
>>> BUT I've played a bit with ProcessPage() + BeginDocument() before and 
>>> EndDocument() after and the resulting file has big differences. Sadly, the 
>>> file is still corrupt.
>>>
>>> So it seems the problem is based on the failing begin/enddocument 
>>> function. But even there I'm experiencing mysterious bugs.
>>> Using only EndDocument(), I have something like a footer at the end of 
>>> the file:
>>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: 
>>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]
>>>
>>> But it suddenly stops at "Produce". But when I'm using BeginDocument(), 
>>> ProcessPage() and then EndDocument() the file is ending with bytes and 
>>> there is no "endstream" or "endobj".
>>> I've updated to latest 4.1.3 version but problem still exists.
>>>
>>> I updated the bug branch in 
>>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the 
>>> problem is reproducible.
>>> To disable the BeginDocument, one have to comment out 
>>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187
>>> .
>>>
>>> I tried to use 1:1 the code from the tesseract cli but it still does not 
>>> work...
>>>
>>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:
>>>
>>>> seems like the same problem as 
>>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>>>>
>>>> Did you use  BeginDocument EndDocument ?
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
>>>> [email protected]> napísal(a):
>>>>
>>>>> Described already in issue: 
>>>>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>>>>
>>>>> I'm trying to generate a searchable PDF outgoing from a jpg image, but 
>>>>> the file that gets output is an invalid pdf file that can't be read by 
>>>>> any 
>>>>> pdf reader.
>>>>>
>>>>> I have added an docker image for reproduction of the problem in the 
>>>>> issue, but here is the bash snippet for it:
>>>>>
>>>>> *git clone [email protected]:dnnspaul/gosseract.git*
>>>>> *git checkout tesseract/bug/3652*
>>>>>
>>>>> *docker build -t tessbug .*
>>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>>>>
>>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf is 
>>>>> readable, but I can't find any difference between the cli and my snippet.
>>>>>
>>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang 
>>>>> developer, than a C ++ developer so I have kind of problems with the 
>>>>> simplest syntax, but tried my best.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9333d294-40ca-4d03-b733-f02531212022n%40googlegroups.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to