I don't know how big is your "lot of files"… So, sorry if this doesn't 
match:
I use IRFANVIEW (free software) to convert JPG either in PBM or in B&W (2 
COLORS) JPG, with the batch option (open an image and then press letter B 
on keyboard) – but maybe you already know Irfanview…?
It converts very good and quite quickly. I just converted 224 files (jpg 
b&w 256 colors (greys)) in 40 seconds.
If you only have some hundreds of files, it could be a pretty solution… But 
if you have thousands or 10.000's, surely it could be heavy…


Il giorno martedì 14 marzo 2023 alle 23:25:49 UTC+1 da...@mranderson.co.nz 
ha scritto:

> Thank you for your input. I appreciate the PBM file type has its uses. But 
> my source material is JPG. And there are a lot of files! 
>
> On Wed, Mar 15, 2023 at 10:41 AM 'Isidore Paris' via tesseract-ocr <
> tesser...@googlegroups.com> wrote:
>
>> I get the best result with PBM images, i.e b&w. Doing that way, there 
>> would be no half-tones… (Don't know if this could help…)
>>
>>
>>
>> Il giorno lunedì 13 marzo 2023 alle 23:17:23 UTC+1 da...@mranderson.co.nz 
>> ha scritto:
>>
>>> I'm preparing text images (JPG) for Tesseract OCR conversion to text 
>>> files (TXT) I note that it is important to resize my image docs so that 
>>> capital letters are about 30-32 pixels in height. See Optimal image 
>>> resolution (dpi/ppi) for Tesseract 4.0.0 and eng.traineddata? 
>>> <https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ?pli=1>
>>>
>>> I am using the Fiji/ImageJ to count capital letter height in pixels. 
>>> From https://imagej.nih.gov/ij/docs/pdfs/ImageJ.pdf 
>>>
>>>    - Open image file
>>>    - Enlarge text (zoom in) 
>>>    - Draw parallel vertical line beside vertical of number or straight 
>>>    edge letter
>>>    - Select Analyze>Set Scale (see image below)
>>>
>>> [image: fiji first.png]
>>>
>>> How to count pixels? Do I count the 'half pixels'? Where the pixel 
>>> 'block' is a half-tone? In other words, for my total count, do I estimate 
>>> the true height by including these half-tones. 
>>>
>>> Does anyone have a better procedure than this?
>>>
>>> My aim is to come up with a resizing ratio that I can apply to a large 
>>> collection of text files using a Python script. This being another step 
>>> along the way to preparing docs for Tesseract. 
>>>
>>> Any suggestions would be appreciated.
>>>
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/tesseract-ocr/bZh3j_i8MYU/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2434d564-f2b5-40df-b180-8465bc9c5c42n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2434d564-f2b5-40df-b180-8465bc9c5c42n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5e6e70e8-b1ab-4b1d-8923-9f8f357210f2n%40googlegroups.com.

Reply via email to