On 2020-04-09 10:16 AM, emanuel stiebler via cctalk wrote: > Hi All, > somebody scanned documents for me in .pdfs. > Looking into them, they are pages of jpgs embedded in .pdf .. > (100 pages resulting in 350MBytes ...) > > Any easy way to convert them into some b/w .pdf file? > It is all text, no drawings ... > > Pointers? > > Thanks >
Typically I extract using pdfimages $ pdfimages pdfimages version 4.00 Copyright 1996-2017 Glyph & Cog, LLC Usage: pdfimages [options] <PDF-file> <image-root> You can then use GraphicsMagick to threshold to bilevel (a suitable threshold can be found by inspecting or histogramming the image e.g. in Photoshop). gm mogrify -threshold XX% -monochrome (or `gm convert` can convert each page to TIF for the next step) Then I'd go via TIFF, combining and compressing all pages as G4 compression using `tiffcp -c g4`, then if you want a PDF instead of multipage tiff, you can transcode to PDF with `tiff2pdf`. tiffcp and tiff2pdf are libtiff utilities. There might be a shortcut using different tools but those are the tools I use. --Toby