Re: [tesseract-ocr] Re: Counting True Pixel Height of a Capital Letter

David R Anderson Thu, 16 Mar 2023 11:32:50 -0700

Thank you. Although, after thinking about it. I think that converting JPG
to PBM in order to get full pixels may be compromising what I am trying to
achieve.


The Tesseract doc  Improving the quality of the output
<https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#image-processing>
mentions “Willus
Dotkom” (Under rescaling) with this link Optical Image Resolution
<https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ> This
page (Optical Image Resolution) is about the pixel height of capital
letters/numbers of source documents.

I need an accurate count (not adjusted because it's easier) The image I
have included in this post shows my measuring 'ruler' on the right of the
*14 *

I am looking for a rule or method that is an accepted scientific approach
regarding the counting of pixels.

On Thu, Mar 16, 2023 at 9:06 PM 'Isidore Paris' via tesseract-ocr <
[email protected]> wrote:

> I don't know how big is your "lot of files"… So, sorry if this doesn't
> match:
> I use IRFANVIEW (free software) to convert JPG either in PBM or in B&W (2
> COLORS) JPG, with the batch option (open an image and then press letter B
> on keyboard) – but maybe you already know Irfanview…?
> It converts very good and quite quickly. I just converted 224 files (jpg
> b&w 256 colors (greys)) in 40 seconds.
> If you only have some hundreds of files, it could be a pretty solution…
> But if you have thousands or 10.000's, surely it could be heavy…
>
>
> Il giorno martedì 14 marzo 2023 alle 23:25:49 UTC+1 [email protected]
> ha scritto:
>
>> Thank you for your input. I appreciate the PBM file type has its uses.
>> But my source material is JPG. And there are a lot of files!
>>
>> On Wed, Mar 15, 2023 at 10:41 AM 'Isidore Paris' via tesseract-ocr <
>> [email protected]> wrote:
>>
>>> I get the best result with PBM images, i.e b&w. Doing that way, there
>>> would be no half-tones… (Don't know if this could help…)
>>>
>>>
>>>
>>> Il giorno lunedì 13 marzo 2023 alle 23:17:23 UTC+1
>>> [email protected] ha scritto:
>>>
>>>> I'm preparing text images (JPG) for Tesseract OCR conversion to text
>>>> files (TXT) I note that it is important to resize my image docs so that
>>>> capital letters are about 30-32 pixels in height. See Optimal image
>>>> resolution (dpi/ppi) for Tesseract 4.0.0 and eng.traineddata?
>>>> <https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ?pli=1>
>>>>
>>>> I am using the Fiji/ImageJ to count capital letter height in pixels.
>>>> From https://imagej.nih.gov/ij/docs/pdfs/ImageJ.pdf
>>>>
>>>>    - Open image file
>>>>    - Enlarge text (zoom in)
>>>>    - Draw parallel vertical line beside vertical of number or straight
>>>>    edge letter
>>>>    - Select Analyze>Set Scale (see image below)
>>>>
>>>> [image: fiji first.png]
>>>>
>>>> How to count pixels? Do I count the 'half pixels'? Where the pixel
>>>> 'block' is a half-tone? In other words, for my total count, do I estimate
>>>> the true height by including these half-tones.
>>>>
>>>> Does anyone have a better procedure than this?
>>>>
>>>> My aim is to come up with a resizing ratio that I can apply to a large
>>>> collection of text files using a Python script. This being another step
>>>> along the way to preparing docs for Tesseract.
>>>>
>>>> Any suggestions would be appreciated.
>>>>
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/tesseract-ocr/bZh3j_i8MYU/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/2434d564-f2b5-40df-b180-8465bc9c5c42n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/2434d564-f2b5-40df-b180-8465bc9c5c42n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/bZh3j_i8MYU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5e6e70e8-b1ab-4b1d-8923-9f8f357210f2n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5e6e70e8-b1ab-4b1d-8923-9f8f357210f2n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKu11d2herZ-2HmMt2%2B%3DmrT8zD2B_HgwetZG1-2Xm8Avsqo5Kg%40mail.gmail.com.

Re: [tesseract-ocr] Re: Counting True Pixel Height of a Capital Letter

Reply via email to