There is a related thread on stack overflow that might be helpful for your 
processing [1]. The thread is about italics and bolding, but font detection 
seems a tougher challenge. This repository [2] has links to Adobe work in the 
area and has an interesting implementation. You would still probably want 
Tesseract in either case to get the bounding boxes for the characters.

Best,

art
---
1. 
https://stackoverflow.com/questions/67577793/detecting-bold-and-italic-text-in-an-image
2. https://github.com/robinreni96/Font_Recognition-DeepFont

From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf 
Of Scott Goci
Sent: Friday, January 5, 2024 12:48 PM
To: tesseract-ocr <tesseract-ocr@googlegroups.com>
Subject: [tesseract-ocr] Re: Article scanning: hocr output wrong after font 
training?

You don't often get email from scot...@gmail.com<mailto:scot...@gmail.com>. 
Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hey Tom,

Overall thanks for your guidance here, I appreciate our back and forth!

RE: "[...] do you really *need* the italics?", I think there is actually a lot 
lost without font attributes (e.g. bold / italic / underline). Consider the 
following sentences / quotes:

  *   "I never said she stole the money"
  *   "I never said she stole the money"
  *   "I never said she stole the money"
  *   "I never said she stole the money"
The context of the above varies drastically depending on which word (if any) 
were italicized.

For other font attributes (e.g. bold/underline) the case for implementation 
aren't as strong, but I still believe we miss some things. E.g. consider the 
following:

  *   Not ten eggs, eaten eggs (e.g. here, underlining helps emphasize a 
specific area of text that changes context of the word at hand)
  *   Scott: What is your biggest accomplishment? (e.g. in an interview 
context, highlighting who is asking the question, especially if there is a 
different person responding)
----

I can definitely try other OCR packages though, but as this is the biggest 
non-commercial OCR library I assume other non-commercial OCR libraries might 
not yield as good results -- I can also try commercial libraries as you suggest 
as well, although now I am beholden to potentially large pricing schemes.

Let me know if you have any final thoughts, but otherwise I'll take the advise 
you've given and go from here!

On Friday, January 5, 2024 at 11:27:10 AM UTC-5 
tfmo...@gmail.com<mailto:tfmo...@gmail.com> wrote:
On Friday, January 5, 2024 at 9:30:05 AM UTC-5 sco...@gmail.com wrote:
Would you offer any suggestions as to next steps I could take from here? E.g. 
it seems my options are:

  1.  I can go back and train the legacy engine (e.g. --oem 0) on the fonts as 
well (I've been using this guide: 
https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/),
 and hope the results improve enough that I get pretty good results
  2.  I can use some sort of post-processing step after tesseract to detect 
italics / bold / etc (although I'm not sure what tools/software/library I'd use 
here for, so I'd really need suggestions)
  3.  I could wait and hope the roadmap for adding back WordFontAttributes to 
the non-legacy engine becomes a priority
  4.  Something else perhaps?
I'm afraid I don't have any magic solutions (or even good suggestions). The 
only thing I can offer is to perhaps not be so fixated on Tesseract as a 
solution.

- would a different OCR package (including commercial) give you better results?
- do you really *need* the italics?
- could you implement a crowdsourced annotation facility that let people add 
the italics later?

Good luck!

Tom
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a279b97d-feca-4650-a22e-c8e8cc4a39c2n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/a279b97d-feca-4650-a22e-c8e8cc4a39c2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YQBPR0101MB9902EEAABD6088D5C94E24A4DC6B2%40YQBPR0101MB9902.CANPRD01.PROD.OUTLOOK.COM.

Reply via email to