Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

'blaumedia' via tesseract-ocr Sun, 21 Nov 2021 14:16:44 -0800

Hi zdenop,

thanks for your tip, but I'm using the ProcessPage*s* function, so it 
should write the head and footer part of the file itself.
BUT I've played a bit with ProcessPage() + BeginDocument() before and 
EndDocument() after and the resulting file has big differences. Sadly, the 
file is still corrupt.


So it seems the problem is based on the failing begin/enddocument function. 
But even there I'm experiencing mysterious bugs.
Using only EndDocument(), I have something like a footer at the end of the 
file:
[image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: 
root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]

But it suddenly stops at "Produce". But when I'm using BeginDocument(), 
ProcessPage() and then EndDocument() the file is ending with bytes and 
there is no "endstream" or "endobj".
I've updated to latest 4.1.3 version but problem still exists.

I updated the bug branch 
in https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the 
problem is reproducible.
To disable the BeginDocument, one have to comment 
out 
https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187.

I tried to use 1:1 the code from the tesseract cli but it still does not 
work...

zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:

> seems like the same problem as 
> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>
> Did you use  BeginDocument EndDocument ?
>
> Zdenko
>
>
> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
> tesser...@googlegroups.com> napísal(a):
>
>> Described already in issue: 
>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>
>> I'm trying to generate a searchable PDF outgoing from a jpg image, but 
>> the file that gets output is an invalid pdf file that can't be read by any 
>> pdf reader.
>>
>> I have added an docker image for reproduction of the problem in the 
>> issue, but here is the bash snippet for it:
>>
>> *git clone g...@github.com:dnnspaul/gosseract.git*
>> *git checkout tesseract/bug/3652*
>>
>> *docker build -t tessbug .*
>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>
>> When I'm inputting the file in the tesseract cli, the outcoming pdf is 
>> readable, but I can't find any difference between the cli and my snippet.
>>
>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang 
>> developer, than a C ++ developer so I have kind of problems with the 
>> simplest syntax, but tried my best.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to