Thank you Merlijn, this is very helpful. I'm very interested in IA's process so I'll have a deep dive through those tools. This confirms my suspicions that there's no way to use an off-the-shelf text editor with a glyphless font. I'll explore these hOCR editor options. All the best,
On Fri, Mar 8, 2024 at 7:03 AM Merlijn B.W. Wajer <merl...@archive.org> wrote: > Hi Mark, > > On 07/03/2024 20:53, Mark Pellegrino wrote: > > I found more info here: > > > https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277 > > > > Glyphless appears to be an 'invisible font' and all that Tesseract > > supports. It seems like the solution it to use Tesseract to generate > > hOCR, then use another tool to combine the source image with the hOCR? > > > > Does anyone have a simple workflow for editing/correcting Tesseract OCR > > documents that they can share? > > If you're looking to do OCR and PDF generation separately, you might > want to look into the Internet Archive's PDF generation tooling, which > is designed to do exactly this (plus some aggressive compression): > https://github.com/internetarchive/archive-pdf-tools (disclaimer: I'm > the author of the tooling) > > As for viewing and editing hOCR, there's a lot of different tools > around, not all fully functional (I haven't tried most of these): > > * https://www.not-implemented.de/hocr-proofreader/ > * https://github.com/kba/hocrjs > * https://github.com/GeReV/hocr-editor-ts / > https://github.com/GeReV/HocrEditor > > There are also some GUI tools that I recall for editing hOCR, but they > might require you to convert to another format first. > > Regards, > Merlijn > > > > > > Thanks again, > > > > On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote: > > > > Hello, > > I'm trying to check PDFs made with Tesseract 5.2 for correctness > > using an OCR editor but am unable to open them in either Abbyy or > > Acrobat. > > > > If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor, > > the software just hangs and crashes. I can open Tesseract PDFs with > > Acrobat Pro, but when I enable the 'Make OCR text visible' option > > in Preflight, all of the text layer turns into unreadable black > > boxes. The font used shows as 'GlyphLessFont' and appears to be > > embedded in the file. > > > > It doesn't matter what training data I use, or what the source image > > was, I always get these results. Any other non-Tesseract made PDF > > works just fine. I'm guessing that the issue is a missing font? I > > don't have much of an understanding about how embedded PDF fonts > > work and I haven't found anything about this in the Tesseract docs. > > Can someone please point me in the right direction? I Thanks. > > > > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > > an email to tesseract-ocr+unsubscr...@googlegroups.com > > <mailto:tesseract-ocr+unsubscr...@googlegroups.com>. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com > < > https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer > >. > > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com.