Hmmm -- makes sense (although unfortunate). Would you offer any suggestions as to next steps I could take from here? E.g. it seems my options are:
1. I can go back and train the legacy engine (e.g. *--oem 0*) on the fonts as well (I've been using this guide: https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/), and hope the results improve enough that I get pretty good results 2. I can use some sort of post-processing step after tesseract to detect italics / bold / etc (although I'm not sure what tools/software/library I'd use here for, so I'd really need suggestions) 3. I could wait and hope the roadmap for adding back WordFontAttributes to the non-legacy engine becomes a priority 4. Something else perhaps? I don't mind putting in the work of learning / training / etc, the main thing I'd be hesitant is to individually correct and cleanup the ~20,000 articles or more that need to be parsed. Let me know what you think! On Thursday, January 4, 2024 at 2:29:22 PM UTC-5 tfmo...@gmail.com wrote: > I believe it's returning what it considers to be the best matching model > (ie "lang"), but, if my experiments with eng+fra are any indication, the > recognition isn't reliable. If it has trouble distinguishing two Romance > languages using the same character set, I doubt it can be counted on to > distinguish two closely related fonts from the same family. > > Tom > > p.s. My earlier comment about the lang= attribute being only at the > paragraph level was wrong. It outputs both a paragraph default and per-word > overrides for words which don't match the paragraph. > > On Wednesday, January 3, 2024 at 10:17:42 AM UTC-5 sco...@gmail.com wrote: > >> You are indeed correct that font attribution recognition is only >> available via the legacy engine (e.g. *--oem 0*), but conversely when >> using the LSTM and tesstrain although it outputs in the end tags that look >> like *lang="...", *the tesstrain documentation does not seem to indicate >> we are training tesseract *only* for a language, and not just for a >> font, e.g. I would assume that if it's recognizing the "language", it's >> doing something similar to when it's recognizing the font, am I wrong? Or >> does the *lang* attribute indicate the first result that potentially >> matched, and not the *closest* result that matched (e.g. the closest >> trained font), which would indeed make it different than *x_font*? >> >> On Tuesday, January 2, 2024 at 8:37:17 PM UTC-5 tfmo...@gmail.com wrote: >> >>> Font attribute recognition is a legacy engine thing only, ie it doesn't >>> exist in the new LSTM engine for Tess 4/5. >>> >>> On Monday, January 1, 2024 at 12:15:27 PM UTC-5 sco...@gmail.com wrote: >>> >>> >>> The problem is, even after training a few different ways with >>> *tesstrain* (e.g. adjusting *exposure* options, *char_spacing *options, >>> etc), when I output to hocr (e.g. using the command *tesseract >>> sherlock-holmes-example.png output -l >>> ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c hocr_font_info=1 >>> hocr) *it still seems to get the font info wrong (see attached files >>> for a sample input and output). >>> >>> As an example, I was hoping the word "*coup-de-maitres*" would be >>> recognized with *lang='ITC-New-Baskerville-Std-Italic'*, but it isn't. >>> Conversely, the word "testifying" shows with >>> *lang='ITC-New-Baskerville-Std-Italic'*, but it is not italic. >>> >>> >>> You appear to be training the font as a language, which is why it's >>> getting output with the `lang=` tag. That's wrong and it should be `x_font >>> <font>` in the title, if it's actually recognizing it as a font and >>> outputting it as such. The HOCR will also contain <em> tags for italic >>> words if an italic font is recognized. >>> >>> I tried using `--oem 0` with the eng model from >>> https://github.com/tesseract-ocr/tessdata and it did output <strong> >>> and <em> tags, but in the wrong places and it's accuracy on the text wasn't >>> as good as the LSTM model. When I used eng+fra, it output language tags, >>> but at the paragraph level, not the word level, and they were mostly wrong. >>> I've attached the output. >>> >>> You can read more about the state of play of getting font attributes out >>> of the current model here (it's possible, but don't look for it any time >>> soon): >>> >>> https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-3278142444 >>> >>> Tom >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3d07b7a2-56e8-401a-a389-1a6282a0052fn%40googlegroups.com.