You are indeed correct that font attribution recognition is only available via the legacy engine (e.g. *--oem 0*), but conversely when using the LSTM and tesstrain although it outputs in the end tags that look like *lang="...", *the tesstrain documentation does not seem to indicate we are training tesseract *only* for a language, and not just for a font, e.g. I would assume that if it's recognizing the "language", it's doing something similar to when it's recognizing the font, am I wrong? Or does the *lang* attribute indicate the first result that potentially matched, and not the *closest* result that matched (e.g. the closest trained font), which would indeed make it different than *x_font*?
On Tuesday, January 2, 2024 at 8:37:17 PM UTC-5 tfmo...@gmail.com wrote: > Font attribute recognition is a legacy engine thing only, ie it doesn't > exist in the new LSTM engine for Tess 4/5. > > On Monday, January 1, 2024 at 12:15:27 PM UTC-5 sco...@gmail.com wrote: > > > The problem is, even after training a few different ways with *tesstrain* > (e.g. > adjusting *exposure* options, *char_spacing *options, etc), when I output > to hocr (e.g. using the command *tesseract sherlock-holmes-example.png > output -l ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c > hocr_font_info=1 hocr) *it still seems to get the font info wrong (see > attached files for a sample input and output). > > As an example, I was hoping the word "*coup-de-maitres*" would be > recognized with *lang='ITC-New-Baskerville-Std-Italic'*, but it isn't. > Conversely, the word "testifying" shows with > *lang='ITC-New-Baskerville-Std-Italic'*, but it is not italic. > > > You appear to be training the font as a language, which is why it's > getting output with the `lang=` tag. That's wrong and it should be `x_font > <font>` in the title, if it's actually recognizing it as a font and > outputting it as such. The HOCR will also contain <em> tags for italic > words if an italic font is recognized. > > I tried using `--oem 0` with the eng model from > https://github.com/tesseract-ocr/tessdata and it did output <strong> and > <em> tags, but in the wrong places and it's accuracy on the text wasn't as > good as the LSTM model. When I used eng+fra, it output language tags, but > at the paragraph level, not the word level, and they were mostly wrong. > I've attached the output. > > You can read more about the state of play of getting font attributes out > of the current model here (it's possible, but don't look for it any time > soon): > > https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-3278142444 > > Tom > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1131390a-4176-4872-9d5f-e9745ceb7071n%40googlegroups.com.