[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Scott Goci Fri, 05 Jan 2024 06:30:10 -0800

Hmmm -- makes sense (although unfortunate).

Would you offer any suggestions as to next steps I could take from here? 
E.g. it seems my options are:


   1. I can go back and train the legacy engine (e.g. *--oem 0*) on the 
   fonts as well (I've been using this guide: 
   
https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/),
 
   and hope the results improve enough that I get pretty good results
   2. I can use some sort of post-processing step after tesseract to detect 
   italics / bold / etc (although I'm not sure what tools/software/library I'd 
   use here for, so I'd really need suggestions)
   3. I could wait and hope the roadmap for adding back WordFontAttributes 
   to the non-legacy engine becomes a priority
   4. Something else perhaps?

I don't mind putting in the work of learning / training / etc, the main 
thing I'd be hesitant is to individually correct and cleanup the ~20,000 
articles or more that need to be parsed.

Let me know what you think!
On Thursday, January 4, 2024 at 2:29:22 PM UTC-5 tfmo...@gmail.com wrote:

> I believe it's returning what it considers to be the best matching model 
> (ie "lang"), but, if my experiments with eng+fra are any indication, the 
> recognition isn't reliable. If it has trouble distinguishing two Romance 
> languages using the same character set, I doubt it can be counted on to 
> distinguish two closely related fonts from the same family.
>
> Tom
>
> p.s. My earlier comment about the lang= attribute being only at the 
> paragraph level was wrong. It outputs both a paragraph default and per-word 
> overrides for words which don't match the paragraph.
>
> On Wednesday, January 3, 2024 at 10:17:42 AM UTC-5 sco...@gmail.com wrote:
>
>> You are indeed correct that font attribution recognition is only 
>> available via the legacy engine (e.g. *--oem 0*), but conversely when 
>> using the LSTM and tesstrain although it outputs in the end tags that look 
>> like *lang="...", *the tesstrain documentation does not seem to indicate 
>> we are training tesseract *only* for a language, and not just for a 
>> font, e.g. I would assume that if it's recognizing the "language", it's 
>> doing something similar to when it's recognizing the font, am I wrong? Or 
>> does the *lang* attribute indicate the first result that potentially 
>> matched, and not the *closest* result that matched (e.g. the closest 
>> trained font), which would indeed make it different than *x_font*?
>>
>> On Tuesday, January 2, 2024 at 8:37:17 PM UTC-5 tfmo...@gmail.com wrote:
>>
>>> Font attribute recognition is a legacy engine thing only, ie it doesn't 
>>> exist in the new LSTM engine for Tess 4/5.
>>>
>>> On Monday, January 1, 2024 at 12:15:27 PM UTC-5 sco...@gmail.com wrote:
>>>
>>>
>>> The problem is, even after training a few different ways with 
>>> *tesstrain* (e.g. adjusting *exposure* options, *char_spacing *options, 
>>> etc), when I output to hocr (e.g. using the command *tesseract 
>>> sherlock-holmes-example.png output -l 
>>> ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c hocr_font_info=1 
>>> hocr) *it still seems to get the font info wrong (see attached files 
>>> for a sample input and output). 
>>>
>>> As an example, I was hoping the word "*coup-de-maitres*" would be 
>>> recognized with *lang='ITC-New-Baskerville-Std-Italic'*, but it isn't. 
>>> Conversely, the word "testifying" shows with 
>>> *lang='ITC-New-Baskerville-Std-Italic'*, but it is not italic.
>>>
>>>
>>> You appear to be training the font as a language, which is why it's 
>>> getting output with the `lang=` tag. That's wrong and it should be `x_font 
>>> <font>` in the title, if it's actually recognizing it as a font and 
>>> outputting it as such. The HOCR will also contain <em> tags for italic 
>>> words if an italic font is recognized. 
>>>
>>> I tried using `--oem 0` with the eng model from 
>>> https://github.com/tesseract-ocr/tessdata and it did output <strong> 
>>> and <em> tags, but in the wrong places and it's accuracy on the text wasn't 
>>> as good as the LSTM model. When I used eng+fra, it output language tags, 
>>> but at the paragraph level, not the word level, and they were mostly wrong. 
>>> I've attached the output.
>>>
>>> You can read more about the state of play of getting font attributes out 
>>> of the current model here (it's possible, but don't look for it any time 
>>> soon):
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-3278142444
>>>
>>> Tom
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3d07b7a2-56e8-401a-a389-1a6282a0052fn%40googlegroups.com.

[tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

Reply via email to