I don't know how big is your "lot of files"… So, sorry if this doesn't match: I use IRFANVIEW (free software) to convert JPG either in PBM or in B&W (2 COLORS) JPG, with the batch option (open an image and then press letter B on keyboard) – but maybe you already know Irfanview…? It converts very good and quite quickly. I just converted 224 files (jpg b&w 256 colors (greys)) in 40 seconds. If you only have some hundreds of files, it could be a pretty solution… But if you have thousands or 10.000's, surely it could be heavy…
Il giorno martedì 14 marzo 2023 alle 23:25:49 UTC+1 da...@mranderson.co.nz ha scritto: > Thank you for your input. I appreciate the PBM file type has its uses. But > my source material is JPG. And there are a lot of files! > > On Wed, Mar 15, 2023 at 10:41 AM 'Isidore Paris' via tesseract-ocr < > tesser...@googlegroups.com> wrote: > >> I get the best result with PBM images, i.e b&w. Doing that way, there >> would be no half-tones… (Don't know if this could help…) >> >> >> >> Il giorno lunedì 13 marzo 2023 alle 23:17:23 UTC+1 da...@mranderson.co.nz >> ha scritto: >> >>> I'm preparing text images (JPG) for Tesseract OCR conversion to text >>> files (TXT) I note that it is important to resize my image docs so that >>> capital letters are about 30-32 pixels in height. See Optimal image >>> resolution (dpi/ppi) for Tesseract 4.0.0 and eng.traineddata? >>> <https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ?pli=1> >>> >>> I am using the Fiji/ImageJ to count capital letter height in pixels. >>> From https://imagej.nih.gov/ij/docs/pdfs/ImageJ.pdf >>> >>> - Open image file >>> - Enlarge text (zoom in) >>> - Draw parallel vertical line beside vertical of number or straight >>> edge letter >>> - Select Analyze>Set Scale (see image below) >>> >>> [image: fiji first.png] >>> >>> How to count pixels? Do I count the 'half pixels'? Where the pixel >>> 'block' is a half-tone? In other words, for my total count, do I estimate >>> the true height by including these half-tones. >>> >>> Does anyone have a better procedure than this? >>> >>> My aim is to come up with a resizing ratio that I can apply to a large >>> collection of text files using a Python script. This being another step >>> along the way to preparing docs for Tesseract. >>> >>> Any suggestions would be appreciated. >>> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/bZh3j_i8MYU/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/2434d564-f2b5-40df-b180-8465bc9c5c42n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/2434d564-f2b5-40df-b180-8465bc9c5c42n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5e6e70e8-b1ab-4b1d-8923-9f8f357210f2n%40googlegroups.com.