Thank you Merlijn, this is very helpful.  I'm very interested in IA's
process so I'll have a deep dive through those tools.  This confirms my
suspicions that there's no way to use an off-the-shelf text editor with a
glyphless font. I'll explore these hOCR editor options. All the best,

On Fri, Mar 8, 2024 at 7:03 AM Merlijn B.W. Wajer <merl...@archive.org>
wrote:

> Hi Mark,
>
> On 07/03/2024 20:53, Mark Pellegrino wrote:
> > I found more info here:
> >
> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
> >
> > Glyphless appears to be an 'invisible font' and all that Tesseract
> > supports. It seems like the solution it to use Tesseract to generate
> > hOCR, then use another tool to combine the source image with the hOCR?
> >
> > Does anyone have a simple workflow for editing/correcting Tesseract OCR
> > documents that they can share?
>
> If you're looking to do OCR and PDF generation separately, you might
> want to look into the Internet Archive's PDF generation tooling, which
> is designed to do exactly this (plus some aggressive compression):
> https://github.com/internetarchive/archive-pdf-tools (disclaimer: I'm
> the author of the tooling)
>
> As for viewing and editing hOCR, there's a lot of different tools
> around, not all fully functional (I haven't tried most of these):
>
> * https://www.not-implemented.de/hocr-proofreader/
> * https://github.com/kba/hocrjs
> * https://github.com/GeReV/hocr-editor-ts /
> https://github.com/GeReV/HocrEditor
>
> There are also some GUI tools that I recall for editing hOCR, but they
> might require you to convert to another format first.
>
> Regards,
> Merlijn
>
>
> >
> > Thanks again,
> >
> > On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
> >
> >     Hello,
> >     I'm trying to check PDFs made with Tesseract 5.2 for correctness
> >     using an OCR editor but am unable to open them in either Abbyy or
> >     Acrobat.
> >
> >     If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor,
> >     the software just hangs and crashes. I can open Tesseract PDFs with
> >     Acrobat Pro, but when I enable the  'Make OCR text visible' option
> >     in Preflight, all of the text layer turns into unreadable black
> >     boxes. The font used shows as 'GlyphLessFont' and appears to be
> >     embedded in the file.
> >
> >     It doesn't matter what training data I use, or what the source image
> >     was, I always get these results. Any other non-Tesseract made PDF
> >     works just fine. I'm guessing that the issue is a missing font? I
> >     don't have much of an understanding about how embedded PDF fonts
> >     work and I haven't found anything about this in the Tesseract docs.
> >     Can someone please point me in the right direction? I Thanks.
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to tesseract-ocr+unsubscr...@googlegroups.com
> > <mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com
> <
> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer
> >.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com.

Reply via email to