Hi Mark, On 08/03/2024 20:24, Mark Pellegrino wrote:
Thank you Merlijn, this is very helpful. I'm very interested in IA's process so I'll have a deep dive through those tools. This confirms my suspicions that there's no way to use an off-the-shelf text editor with a glyphless font. I'll explore these hOCR editor options. All the best,
As I understand it the main reason that there is no 'editor' for PDFs with text is that the text in PDFs in inherently not structured in a hierarchical manner, so by going from hOCR (or another format) -> PDF text you lose a lot of structure. Even the PDF text reading order might differ per PDF renderer - it's just text rendered in a coordinate space, so it's not a particular good fit for 'editing'.
Regards, Merlijn -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0b80ac5b-3d25-4d54-9868-8e6ebac97b0b%40archive.org.