There is a related thread on stack overflow that might be helpful for your processing [1]. The thread is about italics and bolding, but font detection seems a tougher challenge. This repository [2] has links to Adobe work in the area and has an interesting implementation. You would still probably want Tesseract in either case to get the bounding boxes for the characters.
Best, art --- 1. https://stackoverflow.com/questions/67577793/detecting-bold-and-italic-text-in-an-image 2. https://github.com/robinreni96/Font_Recognition-DeepFont From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf Of Scott Goci Sent: Friday, January 5, 2024 12:48 PM To: tesseract-ocr <tesseract-ocr@googlegroups.com> Subject: [tesseract-ocr] Re: Article scanning: hocr output wrong after font training? You don't often get email from scot...@gmail.com<mailto:scot...@gmail.com>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Hey Tom, Overall thanks for your guidance here, I appreciate our back and forth! RE: "[...] do you really *need* the italics?", I think there is actually a lot lost without font attributes (e.g. bold / italic / underline). Consider the following sentences / quotes: * "I never said she stole the money" * "I never said she stole the money" * "I never said she stole the money" * "I never said she stole the money" The context of the above varies drastically depending on which word (if any) were italicized. For other font attributes (e.g. bold/underline) the case for implementation aren't as strong, but I still believe we miss some things. E.g. consider the following: * Not ten eggs, eaten eggs (e.g. here, underlining helps emphasize a specific area of text that changes context of the word at hand) * Scott: What is your biggest accomplishment? (e.g. in an interview context, highlighting who is asking the question, especially if there is a different person responding) ---- I can definitely try other OCR packages though, but as this is the biggest non-commercial OCR library I assume other non-commercial OCR libraries might not yield as good results -- I can also try commercial libraries as you suggest as well, although now I am beholden to potentially large pricing schemes. Let me know if you have any final thoughts, but otherwise I'll take the advise you've given and go from here! On Friday, January 5, 2024 at 11:27:10 AM UTC-5 tfmo...@gmail.com<mailto:tfmo...@gmail.com> wrote: On Friday, January 5, 2024 at 9:30:05 AM UTC-5 sco...@gmail.com wrote: Would you offer any suggestions as to next steps I could take from here? E.g. it seems my options are: 1. I can go back and train the legacy engine (e.g. --oem 0) on the fonts as well (I've been using this guide: https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/), and hope the results improve enough that I get pretty good results 2. I can use some sort of post-processing step after tesseract to detect italics / bold / etc (although I'm not sure what tools/software/library I'd use here for, so I'd really need suggestions) 3. I could wait and hope the roadmap for adding back WordFontAttributes to the non-legacy engine becomes a priority 4. Something else perhaps? I'm afraid I don't have any magic solutions (or even good suggestions). The only thing I can offer is to perhaps not be so fixated on Tesseract as a solution. - would a different OCR package (including commercial) give you better results? - do you really *need* the italics? - could you implement a crowdsourced annotation facility that let people add the italics later? Good luck! Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a279b97d-feca-4650-a22e-c8e8cc4a39c2n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/a279b97d-feca-4650-a22e-c8e8cc4a39c2n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YQBPR0101MB9902EEAABD6088D5C94E24A4DC6B2%40YQBPR0101MB9902.CANPRD01.PROD.OUTLOOK.COM.