Re: [tesseract-ocr] How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

Zdenko Podobny Wed, 14 Apr 2021 10:03:25 -0700

Tesseract is an OCR engine and it does not change input image.
For recompressing pdf you need other tools e.g.  jbig2enc [1] ,  mupdf
[2]...


[1] https://github.com/agl/jbig2enc
[2] https://mupdf.com/docs/manual-mutool-convert.html

Zdenko


st 14. 4. 2021 o 15:26 Sharp Subbu <sharpsu...@gmail.com> napísal(a):

> Dear Merlijn,
>
> Thank you very much for your reply.
> We are doing feasibility study on using Tesseract OCR featurs in our
> project on Windows 10 English 32/64-bit OS.
> As part of this study, i am trying to find that is it possible to compress
> / reduce the size of the pdf file created by Tesseract OCR (CommandLine: >
> Tesseract input.tif outputFile pdf).
> To find answer for this question, I have checked tesseract forums, and
> Tesseract APIs. I did not find any related information. Hence, I have
> posted the same question in Tesseract Google forums.
> Regarding this, i received nice reply from you. Thank you very much for
> that.
> Firstly, clarify that is Tesseract OCR API supports reducing / compressing
> the OCRed pdf file. Is this support present or not in Tesseract OCR sourc
> code.
>
> Kindly fin dthe attached sample pdf file "Sample.pdf" for your reference.
> Kindly compress it and send the compressed pdf file.
>
> Thank you very much for your nice help.
> Subramanyam
>
>
> On Wednesday, April 14, 2021 at 6:27:43 PM UTC+5:30 Merlijn Wajer wrote:
>
>> Hi,
>>
>> On 14/04/2021 13:52, Sharp Subbu wrote:
>> > Dear friends,
>> >
>> > Kindly guide/help us to find solution for the below point:
>> > =============================
>> > How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.
>> > ===============================
>>
>> Not sure exactly what use case you have in mind (OS, etc), but I have a
>> suggestion, as I dealt with this in the recent past.
>>
>> I developed something similar to the foxit/luratech "PDF compression",
>> in Python and it is entirely open source. It uses the Tesseract hOCR
>> result files. The can lead to 3-15x compression ratios (sometimes more,
>> depending on the image formats that you use).
>>
>> It converts images to JPEG2000 for best compression (but slower loading
>> times) and also attempts to create a "foreground", "background" and
>> "mask" image (Mixed Raster Content [0]), which can significantly improve
>> compression. It inserts a text layer just like Tesseract does (the code
>> is a port of Tesseract's C++).
>>
>> Here is some info [1], and here is the source code [2].
>>
>> There is a "openjpeg-wip" branch that can use OpenJPEG instead of Kakadu
>> for image compression.
>>
>> Example usage to create a PDF from a set of images:
>>
>> recode_pdf --from-imagestack 'images/*.jp2' --hocr-file
>> combined_tesseract_results.html -o out.pdf -v --use-openjpeg -m 2
>>
>> There is also the --from-pdf option instead of --from-imagestack, but
>> that has only seen light testing.
>>
>> You can combine the hOCR result files using hocr-combine-stream [3]
>>
>> If this suits your use case, I'd be happy to help/assist here or off
>> list. There aren't many users of the software yet (the same offer
>> extends for others reading this list). If you have an example PDF that
>> you can send me, I'd be happy to try to send you a compressed PDF back.
>>
>> Cheers,
>> Merlijn
>>
>>
>> [0] https://en.wikipedia.org/wiki/Mixed_raster_content
>> [1] https://archive.org/~merlijn/projects/archive-pdf-tools/index.html
>> [2] https://git.archive.org/merlijn/archive-pdf-tools
>> [3]
>>
>> https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3993df63-515e-42b9-9e31-ffd5eb0f2d32n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/3993df63-515e-42b9-9e31-ffd5eb0f2d32n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y7V20HQ4pTRBq3TL6Z_cBfQwb_%3D-0oKtceER93Bk_Kww%40mail.gmail.com.

Re: [tesseract-ocr] How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

Reply via email to