I'm preparing text images (JPG) for Tesseract OCR conversion to text files 
(TXT) I note that it is important to resize my image docs so that capital 
letters are about 30-32 pixels in height. See Optimal image resolution 
(dpi/ppi) for Tesseract 4.0.0 and eng.traineddata? 
<https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ?pli=1>

I am using the Fiji/ImageJ to count capital letter height in pixels. From 
https://imagej.nih.gov/ij/docs/pdfs/ImageJ.pdf 

   - Open image file
   - Enlarge text (zoom in) 
   - Draw parallel vertical line beside vertical of number or straight edge 
   letter
   - Select Analyze>Set Scale (see image below)

[image: fiji first.png]

How to count pixels? Do I count the 'half pixels'? Where the pixel 'block' 
is a half-tone? In other words, for my total count, do I estimate the true 
height by including these half-tones. 

Does anyone have a better procedure than this?

My aim is to come up with a resizing ratio that I can apply to a large 
collection of text files using a Python script. This being another step 
along the way to preparing docs for Tesseract. 

Any suggestions would be appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7f69a40a-cff5-4619-be41-58c9026a8946n%40googlegroups.com.

Reply via email to