[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Scott Goci Wed, 03 Jan 2024 07:17:46 -0800

You are indeed correct that font attribution recognition is only available 
via the legacy engine (e.g. *--oem 0*), but conversely when using the LSTM 
and tesstrain although it outputs in the end tags that look like *lang="...", 
*the tesstrain documentation does not seem to indicate we are training 
tesseract *only* for a language, and not just for a font, e.g. I would 
assume that if it's recognizing the "language", it's doing something 
similar to when it's recognizing the font, am I wrong? Or does the *lang* 
attribute indicate the first result that potentially matched, and not the 
*closest* result that matched (e.g. the closest trained font), which would 
indeed make it different than *x_font*?


On Tuesday, January 2, 2024 at 8:37:17 PM UTC-5 tfmo...@gmail.com wrote:

> Font attribute recognition is a legacy engine thing only, ie it doesn't 
> exist in the new LSTM engine for Tess 4/5.
>
> On Monday, January 1, 2024 at 12:15:27 PM UTC-5 sco...@gmail.com wrote:
>
>
> The problem is, even after training a few different ways with *tesstrain* 
> (e.g. 
> adjusting *exposure* options, *char_spacing *options, etc), when I output 
> to hocr (e.g. using the command *tesseract sherlock-holmes-example.png 
> output -l ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c 
> hocr_font_info=1 hocr) *it still seems to get the font info wrong (see 
> attached files for a sample input and output). 
>
> As an example, I was hoping the word "*coup-de-maitres*" would be 
> recognized with *lang='ITC-New-Baskerville-Std-Italic'*, but it isn't. 
> Conversely, the word "testifying" shows with 
> *lang='ITC-New-Baskerville-Std-Italic'*, but it is not italic.
>
>
> You appear to be training the font as a language, which is why it's 
> getting output with the `lang=` tag. That's wrong and it should be `x_font 
> <font>` in the title, if it's actually recognizing it as a font and 
> outputting it as such. The HOCR will also contain <em> tags for italic 
> words if an italic font is recognized. 
>
> I tried using `--oem 0` with the eng model from 
> https://github.com/tesseract-ocr/tessdata and it did output <strong> and 
> <em> tags, but in the wrong places and it's accuracy on the text wasn't as 
> good as the LSTM model. When I used eng+fra, it output language tags, but 
> at the paragraph level, not the word level, and they were mostly wrong. 
> I've attached the output.
>
> You can read more about the state of play of getting font attributes out 
> of the current model here (it's possible, but don't look for it any time 
> soon):
>
> https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-3278142444
>
> Tom
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1131390a-4176-4872-9d5f-e9745ceb7071n%40googlegroups.com.

[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Reply via email to