Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Zdenko Podobny Mon, 22 Nov 2021 13:27:46 -0800

Hello,

yes, it works for me also with tesseract 4.1.3 (the latest version). AFAIR
there was no change in behaviour of renderer (including TessPDFRenderer)
from the 4.0-beta version.
Also, I did not get your problem with OpenCV - AFAIK tesseract is the only
optional dependency and it uses only very limited tesseract features[1].
Because you will use anyway tesseract directly for creating pdf, it does
not make sense to care about old tesseract support in OpenCV.


[1]
https://docs.opencv.org/4.5.4/d7/ddc/classcv_1_1text_1_1OCRTesseract.html


Zdenko


po 22. 11. 2021 o 19:54 'blaumedia' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hey zdenop,
>
> turns out I can't rely on 5.0.0, because OpenCV seems to only is
> compatible with 4.x yet. (OpenCV is another requirement of my project).
> Does your script from above works on tesseract 4.x for you?
>
> blaumedia schrieb am Montag, 22. November 2021 um 18:51:38 UTC+1:
>
>> It works!
>>
>> I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and
>> your code worked flawlessly. It seems like the 4.1.3 has a bug in it, that
>> has been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be
>> more unstable.
>> I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.7.x somewhat before) and
>> the problem with corrupt pdf still exists. But that's not a problem, I will
>> use 5.0.0 instead.
>>
>> Thank you zdenop!
>>
>> zdenop schrieb am Montag, 22. November 2021 um 14:33:15 UTC+1:
>>
>>> this is my old snippet, so part of the code is useless for pdf rendering
>>> (opening the input image as PIX).
>>>
>>> Zdenko
>>>
>>>
>>> po 22. 11. 2021 o 14:28 Zdenko Podobny <zde...@gmail.com> napísal(a):
>>>
>>>> Here is a simple code, that works for me (with tesseract 5 and
>>>> leptonica 1.82)
>>>>
>>>> #include <leptonica/allheaders.h>
>>>> #include <tesseract/baseapi.h>
>>>> #include <tesseract/renderer.h>
>>>> #include <string>
>>>>
>>>> int main() {
>>>>     const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
>>>>     std::string language_ = "eng";
>>>>     std::string inputFile_ = "input.png";
>>>>     const char* outputbase = "output";
>>>>
>>>>     tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>>>>     if (api->Init(datapath, language_.c_str(),
>>>> tesseract::OEM_LSTM_ONLY)) {
>>>>         fprintf(stderr, "Could not initialize tesseract.\n");
>>>>         exit(1);
>>>>     }
>>>>
>>>>     PIX *sourceImg = pixRead(inputFile_.c_str());
>>>>     if (!sourceImg) {
>>>>         fprintf(stderr, "Leptonica can't process input file: %s\n",
>>>>                 inputFile_.c_str());
>>>>         return EXIT_FAILURE;
>>>>     }
>>>>     api->SetImage(sourceImg);
>>>>     api->SetInputName(inputFile_.c_str());
>>>>     api->SetOutputName(outputbase);
>>>>
>>>>     tesseract::TessPDFRenderer* renderer =
>>>>         new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
>>>>     if (!renderer->happy()) {
>>>>         printf("Error, could not create PDF output file: %s\n",
>>>>                strerror(errno));
>>>>         delete renderer;
>>>>     }
>>>>
>>>>     bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0,
>>>> renderer);
>>>>     if (!succeed) {
>>>>         fprintf(stderr, "Error during processing.\n");
>>>>         return EXIT_FAILURE;
>>>>     }
>>>>
>>>>     api->End();
>>>>     pixDestroy(&sourceImg);
>>>>     return 0;
>>>> }
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <
>>>> tesser...@googlegroups.com> napísal(a):
>>>>
>>>>> Hi zdenop,
>>>>>
>>>>> thanks for your tip, but I'm using the ProcessPage*s* function, so it
>>>>> should write the head and footer part of the file itself.
>>>>> BUT I've played a bit with ProcessPage() + BeginDocument() before and
>>>>> EndDocument() after and the resulting file has big differences. Sadly, the
>>>>> file is still corrupt.
>>>>>
>>>>> So it seems the problem is based on the failing begin/enddocument
>>>>> function. But even there I'm experiencing mysterious bugs.
>>>>> Using only EndDocument(), I have something like a footer at the end of
>>>>> the file:
>>>>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH:
>>>>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]
>>>>>
>>>>> But it suddenly stops at "Produce". But when I'm using
>>>>> BeginDocument(), ProcessPage() and then EndDocument() the file is ending
>>>>> with bytes and there is no "endstream" or "endobj".
>>>>> I've updated to latest 4.1.3 version but problem still exists.
>>>>>
>>>>> I updated the bug branch in
>>>>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so
>>>>> the problem is reproducible.
>>>>> To disable the BeginDocument, one have to comment out
>>>>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187
>>>>> .
>>>>>
>>>>> I tried to use 1:1 the code from the tesseract cli but it still does
>>>>> not work...
>>>>>
>>>>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:
>>>>>
>>>>>> seems like the same problem as
>>>>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>>>>>>
>>>>>> Did you use  BeginDocument EndDocument ?
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
>>>>>> tesser...@googlegroups.com> napísal(a):
>>>>>>
>>>>>>> Described already in issue:
>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>>>>>>
>>>>>>> I'm trying to generate a searchable PDF outgoing from a jpg image,
>>>>>>> but the file that gets output is an invalid pdf file that can't be read 
>>>>>>> by
>>>>>>> any pdf reader.
>>>>>>>
>>>>>>> I have added an docker image for reproduction of the problem in the
>>>>>>> issue, but here is the bash snippet for it:
>>>>>>>
>>>>>>> *git clone g...@github.com:dnnspaul/gosseract.git*
>>>>>>> *git checkout tesseract/bug/3652*
>>>>>>>
>>>>>>> *docker build -t tessbug .*
>>>>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>>>>>>
>>>>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf
>>>>>>> is readable, but I can't find any difference between the cli and my 
>>>>>>> snippet.
>>>>>>>
>>>>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang
>>>>>>> developer, than a C ++ developer so I have kind of problems with the
>>>>>>> simplest syntax, but tried my best.
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zFd1q_8WFuEHhfoOZ5n4yrWK_7QWV1Q5KHFj%2B5MqLtUg%40mail.gmail.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to