I am working a project that involves turning text pages from scanned 
microfilm into searchable PDFs

My workflow is like this —

(1) Import raw scan images (*.tif) into Abbyy FineReader v. 12 Professional 
for some basic image editing including split, deskew, rough crop, and some 
visual cleanup e.g. microfilm dust. Export as multipage .tif. (Most 
documents are 2 or 3 pages; a small percentage are 7-8 pages.)
(2) Import edited images to Irfanview 4.58 for further editing, normally as 
follows
   (a) auto crop borders (ctrl-ctrl-Y)
   (b) change canvas size (shift-V) using Method 1 to set top and left 
margins and then Method 2 to padthe right and bottom margins to achieve 
standard starting corner and page size.
   (c) light editing to clean up any stray marks (copy/past white 
background color to mask marks).
   (d) repeat as necessary for subsequent pages. NOTE: As far as I can 
tell, changes in multipage tif files have to be saved individually in 
IrfanView or changes will be lost when moving to another page.
(3) Run edited tif file through Tesseract v5.0.1.20220118 using this format 
on the Windows 10 command line:   tesseract input.tif input pdf --psm 4

The resulting PDF files were as expected, except for the size relative to 
the input tif files.

The input files were both two pages and approximately the same size: 3,296 
characters for 56143 and 3,194 for 56145. 

56143.pdf   998k (2.7 times the size of the tif file)
56143.tif   369k 
56145.pdf    94k (half the size of the tif file)
56145.tif   206k

I'm not terribly concerned about reducing the PDF file sizes, but I'm just 
baffled by why the PDF file size seems to have no relation to the input 
file size.

I don't know if this is really a Tesseract issue, but since that is the 
software that actually generated the PDF I thought this is a good place to 
start.

Thanks,
Art in Northern Virginia



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com.

Reply via email to