I am working a project that involves turning text pages from scanned microfilm into searchable PDFs
My workflow is like this — (1) Import raw scan images (*.tif) into Abbyy FineReader v. 12 Professional for some basic image editing including split, deskew, rough crop, and some visual cleanup e.g. microfilm dust. Export as multipage .tif. (Most documents are 2 or 3 pages; a small percentage are 7-8 pages.) (2) Import edited images to Irfanview 4.58 for further editing, normally as follows (a) auto crop borders (ctrl-ctrl-Y) (b) change canvas size (shift-V) using Method 1 to set top and left margins and then Method 2 to padthe right and bottom margins to achieve standard starting corner and page size. (c) light editing to clean up any stray marks (copy/past white background color to mask marks). (d) repeat as necessary for subsequent pages. NOTE: As far as I can tell, changes in multipage tif files have to be saved individually in IrfanView or changes will be lost when moving to another page. (3) Run edited tif file through Tesseract v5.0.1.20220118 using this format on the Windows 10 command line: tesseract input.tif input pdf --psm 4 The resulting PDF files were as expected, except for the size relative to the input tif files. The input files were both two pages and approximately the same size: 3,296 characters for 56143 and 3,194 for 56145. 56143.pdf 998k (2.7 times the size of the tif file) 56143.tif 369k 56145.pdf 94k (half the size of the tif file) 56145.tif 206k I'm not terribly concerned about reducing the PDF file sizes, but I'm just baffled by why the PDF file size seems to have no relation to the input file size. I don't know if this is really a Tesseract issue, but since that is the software that actually generated the PDF I thought this is a good place to start. Thanks, Art in Northern Virginia -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com.