[tesseract-ocr] OCRs produced by Tesseract differ wildly in size

ArtmanDC Mon, 21 Mar 2022 12:24:40 -0700

I am working a project that involves turning text pages from scanned 
microfilm into searchable PDFs

My workflow is like this —

(1) Import raw scan images (*.tif) into Abbyy FineReader v. 12 Professional
for some basic image editing including split, deskew, rough crop, and some
visual cleanup e.g. microfilm dust. Export as multipage .tif. (Most
documents are 2 or 3 pages; a small percentage are 7-8 pages.)
(2) Import edited images to Irfanview 4.58 for further editing, normally as
follows
(a) auto crop borders (ctrl-ctrl-Y)
(b) change canvas size (shift-V) using Method 1 to set top and left
margins and then Method 2 to padthe right and bottom margins to achieve
standard starting corner and page size.
(c) light editing to clean up any stray marks (copy/past white
background color to mask marks).
(d) repeat as necessary for subsequent pages. NOTE: As far as I can
tell, changes in multipage tif files have to be saved individually in
IrfanView or changes will be lost when moving to another page.
(3) Run edited tif file through Tesseract v5.0.1.20220118 using this format
on the Windows 10 command line: tesseract input.tif input pdf --psm 4

The resulting PDF files were as expected, except for the size relative to
the input tif files.

The input files were both two pages and approximately the same size: 3,296
characters for 56143 and 3,194 for 56145.

56143.pdf 998k (2.7 times the size of the tif file)
56143.tif 369k
56145.pdf 94k (half the size of the tif file)
56145.tif 206k

I'm not terribly concerned about reducing the PDF file sizes, but I'm just
baffled by why the PDF file size seems to have no relation to the input
file size.

I don't know if this is really a Tesseract issue, but since that is the
software that actually generated the PDF I thought this is a good place to
start.

Thanks,
Art in Northern Virginia

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com.

[tesseract-ocr] OCRs produced by Tesseract differ wildly in size

Reply via email to