Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Zdenko Podobny Mon, 22 Nov 2021 05:35:18 -0800

I do not understand your question: how it is related to the discussed topic?


Zdenko


po 22. 11. 2021 o 14:34 Sarah Jane CHANNEL <kangchitan2...@gmail.com>
napísal(a):

> this code can read text?
>
> On Mon, 22 Nov 2021, 21:28 Zdenko Podobny, <zde...@gmail.com> wrote:
>
>> Here is a simple code, that works for me (with tesseract 5 and leptonica
>> 1.82)
>>
>> #include <leptonica/allheaders.h>
>> #include <tesseract/baseapi.h>
>> #include <tesseract/renderer.h>
>> #include <string>
>>
>> int main() {
>>     const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
>>     std::string language_ = "eng";
>>     std::string inputFile_ = "input.png";
>>     const char* outputbase = "output";
>>
>>     tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>>     if (api->Init(datapath, language_.c_str(), tesseract::OEM_LSTM_ONLY))
>> {
>>         fprintf(stderr, "Could not initialize tesseract.\n");
>>         exit(1);
>>     }
>>
>>     PIX *sourceImg = pixRead(inputFile_.c_str());
>>     if (!sourceImg) {
>>         fprintf(stderr, "Leptonica can't process input file: %s\n",
>>                 inputFile_.c_str());
>>         return EXIT_FAILURE;
>>     }
>>     api->SetImage(sourceImg);
>>     api->SetInputName(inputFile_.c_str());
>>     api->SetOutputName(outputbase);
>>
>>     tesseract::TessPDFRenderer* renderer =
>>         new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
>>     if (!renderer->happy()) {
>>         printf("Error, could not create PDF output file: %s\n",
>>                strerror(errno));
>>         delete renderer;
>>     }
>>
>>     bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0,
>> renderer);
>>     if (!succeed) {
>>         fprintf(stderr, "Error during processing.\n");
>>         return EXIT_FAILURE;
>>     }
>>
>>     api->End();
>>     pixDestroy(&sourceImg);
>>     return 0;
>> }
>>
>>
>> Zdenko
>>
>>
>> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <
>> tesseract-ocr@googlegroups.com> napísal(a):
>>
>>> Hi zdenop,
>>>
>>> thanks for your tip, but I'm using the ProcessPage*s* function, so it
>>> should write the head and footer part of the file itself.
>>> BUT I've played a bit with ProcessPage() + BeginDocument() before and
>>> EndDocument() after and the resulting file has big differences. Sadly, the
>>> file is still corrupt.
>>>
>>> So it seems the problem is based on the failing begin/enddocument
>>> function. But even there I'm experiencing mysterious bugs.
>>> Using only EndDocument(), I have something like a footer at the end of
>>> the file:
>>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH:
>>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]
>>>
>>> But it suddenly stops at "Produce". But when I'm using BeginDocument(),
>>> ProcessPage() and then EndDocument() the file is ending with bytes and
>>> there is no "endstream" or "endobj".
>>> I've updated to latest 4.1.3 version but problem still exists.
>>>
>>> I updated the bug branch in
>>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the
>>> problem is reproducible.
>>> To disable the BeginDocument, one have to comment out
>>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187
>>> .
>>>
>>> I tried to use 1:1 the code from the tesseract cli but it still does not
>>> work...
>>>
>>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:
>>>
>>>> seems like the same problem as
>>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>>>>
>>>> Did you use  BeginDocument EndDocument ?
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
>>>> tesser...@googlegroups.com> napísal(a):
>>>>
>>>>> Described already in issue:
>>>>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>>>>
>>>>> I'm trying to generate a searchable PDF outgoing from a jpg image, but
>>>>> the file that gets output is an invalid pdf file that can't be read by any
>>>>> pdf reader.
>>>>>
>>>>> I have added an docker image for reproduction of the problem in the
>>>>> issue, but here is the bash snippet for it:
>>>>>
>>>>> *git clone g...@github.com:dnnspaul/gosseract.git*
>>>>> *git checkout tesseract/bug/3652*
>>>>>
>>>>> *docker build -t tessbug .*
>>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>>>>
>>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf is
>>>>> readable, but I can't find any difference between the cli and my snippet.
>>>>>
>>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang
>>>>> developer, than a C ++ developer so I have kind of problems with the
>>>>> simplest syntax, but tried my best.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x%2B58UYjqq-zr0C2f%3Dazs0_RTVs%3D4p1a9PVu%2BumLOW43Q%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x%2B58UYjqq-zr0C2f%3Dazs0_RTVs%3D4p1a9PVu%2BumLOW43Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABoum5OujufKc0f1jkviCN7DOmYty6mT-jZWVee-ojN4SDNfTQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CABoum5OujufKc0f1jkviCN7DOmYty6mT-jZWVee-ojN4SDNfTQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yZTZHFixy%2B6a3WvqOEJdMSnkaE8VnH%2Bp6Dk981Q7Febg%40mail.gmail.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to