Tesseract is an OCR engine and it does not change input image. For recompressing pdf you need other tools e.g. jbig2enc [1] , mupdf [2]...
[1] https://github.com/agl/jbig2enc [2] https://mupdf.com/docs/manual-mutool-convert.html Zdenko st 14. 4. 2021 o 15:26 Sharp Subbu <sharpsu...@gmail.com> napĂsal(a): > Dear Merlijn, > > Thank you very much for your reply. > We are doing feasibility study on using Tesseract OCR featurs in our > project on Windows 10 English 32/64-bit OS. > As part of this study, i am trying to find that is it possible to compress > / reduce the size of the pdf file created by Tesseract OCR (CommandLine: > > Tesseract input.tif outputFile pdf). > To find answer for this question, I have checked tesseract forums, and > Tesseract APIs. I did not find any related information. Hence, I have > posted the same question in Tesseract Google forums. > Regarding this, i received nice reply from you. Thank you very much for > that. > Firstly, clarify that is Tesseract OCR API supports reducing / compressing > the OCRed pdf file. Is this support present or not in Tesseract OCR sourc > code. > > Kindly fin dthe attached sample pdf file "Sample.pdf" for your reference. > Kindly compress it and send the compressed pdf file. > > Thank you very much for your nice help. > Subramanyam > > > On Wednesday, April 14, 2021 at 6:27:43 PM UTC+5:30 Merlijn Wajer wrote: > >> Hi, >> >> On 14/04/2021 13:52, Sharp Subbu wrote: >> > Dear friends, >> > >> > Kindly guide/help us to find solution for the below point: >> > ============================= >> > How to reduce the size of a OCRed pdf file using Tesseract OCR APIs. >> > =============================== >> >> Not sure exactly what use case you have in mind (OS, etc), but I have a >> suggestion, as I dealt with this in the recent past. >> >> I developed something similar to the foxit/luratech "PDF compression", >> in Python and it is entirely open source. It uses the Tesseract hOCR >> result files. The can lead to 3-15x compression ratios (sometimes more, >> depending on the image formats that you use). >> >> It converts images to JPEG2000 for best compression (but slower loading >> times) and also attempts to create a "foreground", "background" and >> "mask" image (Mixed Raster Content [0]), which can significantly improve >> compression. It inserts a text layer just like Tesseract does (the code >> is a port of Tesseract's C++). >> >> Here is some info [1], and here is the source code [2]. >> >> There is a "openjpeg-wip" branch that can use OpenJPEG instead of Kakadu >> for image compression. >> >> Example usage to create a PDF from a set of images: >> >> recode_pdf --from-imagestack 'images/*.jp2' --hocr-file >> combined_tesseract_results.html -o out.pdf -v --use-openjpeg -m 2 >> >> There is also the --from-pdf option instead of --from-imagestack, but >> that has only seen light testing. >> >> You can combine the hOCR result files using hocr-combine-stream [3] >> >> If this suits your use case, I'd be happy to help/assist here or off >> list. There aren't many users of the software yet (the same offer >> extends for others reading this list). If you have an example PDF that >> you can send me, I'd be happy to try to send you a compressed PDF back. >> >> Cheers, >> Merlijn >> >> >> [0] https://en.wikipedia.org/wiki/Mixed_raster_content >> [1] https://archive.org/~merlijn/projects/archive-pdf-tools/index.html >> [2] https://git.archive.org/merlijn/archive-pdf-tools >> [3] >> >> https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/3993df63-515e-42b9-9e31-ffd5eb0f2d32n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/3993df63-515e-42b9-9e31-ffd5eb0f2d32n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y7V20HQ4pTRBq3TL6Z_cBfQwb_%3D-0oKtceER93Bk_Kww%40mail.gmail.com.