I believe it's returning what it considers to be the best matching model (ie "lang"), but, if my experiments with eng+fra are any indication, the recognition isn't reliable. If it has trouble distinguishing two Romance languages using the same character set, I doubt it can be counted on to distinguish two closely related fonts from the same family.
Tom p.s. My earlier comment about the lang= attribute being only at the paragraph level was wrong. It outputs both a paragraph default and per-word overrides for words which don't match the paragraph. On Wednesday, January 3, 2024 at 10:17:42 AM UTC-5 sco...@gmail.com wrote: > You are indeed correct that font attribution recognition is only available > via the legacy engine (e.g. *--oem 0*), but conversely when using the > LSTM and tesstrain although it outputs in the end tags that look like > *lang="...", > *the tesstrain documentation does not seem to indicate we are training > tesseract *only* for a language, and not just for a font, e.g. I would > assume that if it's recognizing the "language", it's doing something > similar to when it's recognizing the font, am I wrong? Or does the *lang* > attribute indicate the first result that potentially matched, and not the > *closest* result that matched (e.g. the closest trained font), which > would indeed make it different than *x_font*? > > On Tuesday, January 2, 2024 at 8:37:17 PM UTC-5 tfmo...@gmail.com wrote: > >> Font attribute recognition is a legacy engine thing only, ie it doesn't >> exist in the new LSTM engine for Tess 4/5. >> >> On Monday, January 1, 2024 at 12:15:27 PM UTC-5 sco...@gmail.com wrote: >> >> >> The problem is, even after training a few different ways with *tesstrain* >> (e.g. >> adjusting *exposure* options, *char_spacing *options, etc), when I >> output to hocr (e.g. using the command *tesseract >> sherlock-holmes-example.png output -l >> ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c hocr_font_info=1 >> hocr) *it still seems to get the font info wrong (see attached files for >> a sample input and output). >> >> As an example, I was hoping the word "*coup-de-maitres*" would be >> recognized with *lang='ITC-New-Baskerville-Std-Italic'*, but it isn't. >> Conversely, the word "testifying" shows with >> *lang='ITC-New-Baskerville-Std-Italic'*, but it is not italic. >> >> >> You appear to be training the font as a language, which is why it's >> getting output with the `lang=` tag. That's wrong and it should be `x_font >> <font>` in the title, if it's actually recognizing it as a font and >> outputting it as such. The HOCR will also contain <em> tags for italic >> words if an italic font is recognized. >> >> I tried using `--oem 0` with the eng model from >> https://github.com/tesseract-ocr/tessdata and it did output <strong> and >> <em> tags, but in the wrong places and it's accuracy on the text wasn't as >> good as the LSTM model. When I used eng+fra, it output language tags, but >> at the paragraph level, not the word level, and they were mostly wrong. >> I've attached the output. >> >> You can read more about the state of play of getting font attributes out >> of the current model here (it's possible, but don't look for it any time >> soon): >> >> https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-3278142444 >> >> Tom >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/462d2df7-35e5-428d-ae7a-229b0f3ff715n%40googlegroups.com.