[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Tom Morris Thu, 04 Jan 2024 11:29:34 -0800

I believe it's returning what it considers to be the best matching model 
(ie "lang"), but, if my experiments with eng+fra are any indication, the 
recognition isn't reliable. If it has trouble distinguishing two Romance 
languages using the same character set, I doubt it can be counted on to 
distinguish two closely related fonts from the same family.


Tom

p.s. My earlier comment about the lang= attribute being only at the 
paragraph level was wrong. It outputs both a paragraph default and per-word 
overrides for words which don't match the paragraph.

On Wednesday, January 3, 2024 at 10:17:42 AM UTC-5 sco...@gmail.com wrote:

> You are indeed correct that font attribution recognition is only available 
> via the legacy engine (e.g. *--oem 0*), but conversely when using the 
> LSTM and tesstrain although it outputs in the end tags that look like 
> *lang="...", 
> *the tesstrain documentation does not seem to indicate we are training 
> tesseract *only* for a language, and not just for a font, e.g. I would 
> assume that if it's recognizing the "language", it's doing something 
> similar to when it's recognizing the font, am I wrong? Or does the *lang* 
> attribute indicate the first result that potentially matched, and not the 
> *closest* result that matched (e.g. the closest trained font), which 
> would indeed make it different than *x_font*?
>
> On Tuesday, January 2, 2024 at 8:37:17 PM UTC-5 tfmo...@gmail.com wrote:
>
>> Font attribute recognition is a legacy engine thing only, ie it doesn't 
>> exist in the new LSTM engine for Tess 4/5.
>>
>> On Monday, January 1, 2024 at 12:15:27 PM UTC-5 sco...@gmail.com wrote:
>>
>>
>> The problem is, even after training a few different ways with *tesstrain* 
>> (e.g. 
>> adjusting *exposure* options, *char_spacing *options, etc), when I 
>> output to hocr (e.g. using the command *tesseract 
>> sherlock-holmes-example.png output -l 
>> ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c hocr_font_info=1 
>> hocr) *it still seems to get the font info wrong (see attached files for 
>> a sample input and output). 
>>
>> As an example, I was hoping the word "*coup-de-maitres*" would be 
>> recognized with *lang='ITC-New-Baskerville-Std-Italic'*, but it isn't. 
>> Conversely, the word "testifying" shows with 
>> *lang='ITC-New-Baskerville-Std-Italic'*, but it is not italic.
>>
>>
>> You appear to be training the font as a language, which is why it's 
>> getting output with the `lang=` tag. That's wrong and it should be `x_font 
>> <font>` in the title, if it's actually recognizing it as a font and 
>> outputting it as such. The HOCR will also contain <em> tags for italic 
>> words if an italic font is recognized. 
>>
>> I tried using `--oem 0` with the eng model from 
>> https://github.com/tesseract-ocr/tessdata and it did output <strong> and 
>> <em> tags, but in the wrong places and it's accuracy on the text wasn't as 
>> good as the LSTM model. When I used eng+fra, it output language tags, but 
>> at the paragraph level, not the word level, and they were mostly wrong. 
>> I've attached the output.
>>
>> You can read more about the state of play of getting font attributes out 
>> of the current model here (it's possible, but don't look for it any time 
>> soon):
>>
>> https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-3278142444
>>
>> Tom
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/462d2df7-35e5-428d-ae7a-229b0f3ff715n%40googlegroups.com.

[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Reply via email to