Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

'blaumedia' via tesseract-ocr Tue, 23 Nov 2021 07:11:20 -0800

Zdenop! Great news! :D
I recompiled OpenCV on my machine and somehow it resolved the problem. Now 
I can use v5.0.0 and opencv without any problems. Seems like openCV 
depended on old libs in /usr/local/lib (it always searched for 
libtesseract.so.4 but there was no file because I only installed v5). 
Probably it was an easy problem for a C developer, but like I said I'm just 
a entry-level golang developer.


So thank you very very much!

zdenop schrieb am Montag, 22. November 2021 um 22:27:47 UTC+1:

> Hello,
>
> yes, it works for me also with tesseract 4.1.3 (the latest version). AFAIR 
> there was no change in behaviour of renderer (including TessPDFRenderer) 
> from the 4.0-beta version.
> Also, I did not get your problem with OpenCV - AFAIK tesseract is the only 
> optional dependency and it uses only very limited tesseract features[1]. 
> Because you will use anyway tesseract directly for creating pdf, it does 
> not make sense to care about old tesseract support in OpenCV.
>
> [1] 
> https://docs.opencv.org/4.5.4/d7/ddc/classcv_1_1text_1_1OCRTesseract.html
>
>
> Zdenko
>
>
> po 22. 11. 2021 o 19:54 'blaumedia' via tesseract-ocr <
> tesser...@googlegroups.com> napísal(a):
>
>> Hey zdenop,
>>
>> turns out I can't rely on 5.0.0, because OpenCV seems to only is 
>> compatible with 4.x yet. (OpenCV is another requirement of my project).
>> Does your script from above works on tesseract 4.x for you?
>>
>> blaumedia schrieb am Montag, 22. November 2021 um 18:51:38 UTC+1:
>>
>>> It works!
>>>
>>> I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and 
>>> your code worked flawlessly. It seems like the 4.1.3 has a bug in it, that 
>>> has been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be 
>>> more unstable.
>>> I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.7.x somewhat before) 
>>> and the problem with corrupt pdf still exists. But that's not a problem, I 
>>> will use 5.0.0 instead.
>>>
>>> Thank you zdenop!
>>>
>>> zdenop schrieb am Montag, 22. November 2021 um 14:33:15 UTC+1:
>>>
>>>> this is my old snippet, so part of the code is useless for pdf 
>>>> rendering (opening the input image as PIX).
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> po 22. 11. 2021 o 14:28 Zdenko Podobny <zde...@gmail.com> napísal(a):
>>>>
>>>>> Here is a simple code, that works for me (with tesseract 5 and 
>>>>> leptonica 1.82)
>>>>>
>>>>> #include <leptonica/allheaders.h>
>>>>> #include <tesseract/baseapi.h>
>>>>> #include <tesseract/renderer.h>
>>>>> #include <string>
>>>>>
>>>>> int main() {
>>>>>     const char* datapath = 
>>>>> "f:/Project-Personal/tessdata_best/tessdata";
>>>>>     std::string language_ = "eng";
>>>>>     std::string inputFile_ = "input.png";
>>>>>     const char* outputbase = "output";
>>>>>
>>>>>     tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>>>>>     if (api->Init(datapath, language_.c_str(), 
>>>>> tesseract::OEM_LSTM_ONLY)) {
>>>>>         fprintf(stderr, "Could not initialize tesseract.\n");
>>>>>         exit(1);
>>>>>     }
>>>>>
>>>>>     PIX *sourceImg = pixRead(inputFile_.c_str());
>>>>>     if (!sourceImg) {
>>>>>         fprintf(stderr, "Leptonica can't process input file: %s\n",
>>>>>                 inputFile_.c_str());
>>>>>         return EXIT_FAILURE;
>>>>>     }
>>>>>     api->SetImage(sourceImg);
>>>>>     api->SetInputName(inputFile_.c_str());
>>>>>     api->SetOutputName(outputbase);
>>>>>
>>>>>     tesseract::TessPDFRenderer* renderer =
>>>>>         new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
>>>>>     if (!renderer->happy()) {
>>>>>         printf("Error, could not create PDF output file: %s\n",
>>>>>                strerror(errno));
>>>>>         delete renderer;
>>>>>     }
>>>>>
>>>>>     bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, 
>>>>> renderer);
>>>>>     if (!succeed) {
>>>>>         fprintf(stderr, "Error during processing.\n");
>>>>>         return EXIT_FAILURE;
>>>>>     }
>>>>>
>>>>>     api->End();
>>>>>     pixDestroy(&sourceImg);
>>>>>     return 0;
>>>>> }
>>>>>
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <
>>>>> tesser...@googlegroups.com> napísal(a):
>>>>>
>>>>>> Hi zdenop,
>>>>>>
>>>>>> thanks for your tip, but I'm using the ProcessPage*s* function, so 
>>>>>> it should write the head and footer part of the file itself.
>>>>>> BUT I've played a bit with ProcessPage() + BeginDocument() before and 
>>>>>> EndDocument() after and the resulting file has big differences. Sadly, 
>>>>>> the 
>>>>>> file is still corrupt.
>>>>>>
>>>>>> So it seems the problem is based on the failing begin/enddocument 
>>>>>> function. But even there I'm experiencing mysterious bugs.
>>>>>> Using only EndDocument(), I have something like a footer at the end 
>>>>>> of the file:
>>>>>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: 
>>>>>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]
>>>>>>
>>>>>> But it suddenly stops at "Produce". But when I'm using 
>>>>>> BeginDocument(), ProcessPage() and then EndDocument() the file is ending 
>>>>>> with bytes and there is no "endstream" or "endobj".
>>>>>> I've updated to latest 4.1.3 version but problem still exists.
>>>>>>
>>>>>> I updated the bug branch in 
>>>>>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so 
>>>>>> the problem is reproducible.
>>>>>> To disable the BeginDocument, one have to comment out 
>>>>>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187
>>>>>> .
>>>>>>
>>>>>> I tried to use 1:1 the code from the tesseract cli but it still does 
>>>>>> not work...
>>>>>>
>>>>>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:
>>>>>>
>>>>>>> seems like the same problem as 
>>>>>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>>>>>>>
>>>>>>> Did you use  BeginDocument EndDocument ?
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
>>>>>>> tesser...@googlegroups.com> napísal(a):
>>>>>>>
>>>>>>>> Described already in issue: 
>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>>>>>>>
>>>>>>>> I'm trying to generate a searchable PDF outgoing from a jpg image, 
>>>>>>>> but the file that gets output is an invalid pdf file that can't be 
>>>>>>>> read by 
>>>>>>>> any pdf reader.
>>>>>>>>
>>>>>>>> I have added an docker image for reproduction of the problem in the 
>>>>>>>> issue, but here is the bash snippet for it:
>>>>>>>>
>>>>>>>> *git clone g...@github.com:dnnspaul/gosseract.git*
>>>>>>>> *git checkout tesseract/bug/3652*
>>>>>>>>
>>>>>>>> *docker build -t tessbug .*
>>>>>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>>>>>>>
>>>>>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf 
>>>>>>>> is readable, but I can't find any difference between the cli and my 
>>>>>>>> snippet.
>>>>>>>>
>>>>>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang 
>>>>>>>> developer, than a C ++ developer so I have kind of problems with the 
>>>>>>>> simplest syntax, but tried my best.
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/41f1ea56-a486-4f90-ba80-6f8ee9be949fn%40googlegroups.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to