Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-04-11 Thread Jeremiah
Prima folks. I haven’t >>>>>> done much correction of hand-written materials but Alethia seems >>>>>> flexible >>>>>> for a windows environment and exports the page format. You also can >>>>>> start >>>>>&

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-04-10 Thread Greg Jay
t;>>>> allows the use of the Alethia editor [1] from the Prima folks. I haven’t >>>>>> done much correction of hand-written materials but Alethia seems flexible >>>>>> for a windows environment and exports the page format. You also can start >>>

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-04-10 Thread Mark Pellegrino
gt;>>> for a windows environment and exports the page format. You also can start >>>>> with hocr and/or roundtrip between alto, hocr, page, and other xml >>>>> formats >>>>> with the ocr-fileformat project [2], which include

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-31 Thread Jeremiah
ng. >>>> Merlijn and the IA folks have great tools for combing hocr and images to >>>> make a lightweight PDF if that’s your end-goal [3]. >>>> >>>> >>>> >>>> Best, >>>> >>>> >>>> >>&

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Merlijn B.W. Wajer
Hi Mark, On 08/03/2024 20:24, Mark Pellegrino wrote: Thank you Merlijn, this is very helpful. I'm very interested in IA's process so I'll have a deep dive through those tools.  This confirms my suspicions that there's no way to use an off-the-shelf text editor with a glyphless font. I'll explo

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Mark Pellegrino
Thank you Merlijn, this is very helpful. I'm very interested in IA's process so I'll have a deep dive through those tools. This confirms my suspicions that there's no way to use an off-the-shelf text editor with a glyphless font. I'll explore these hOCR editor options. All the best, On Fri, Mar

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Mark Pellegrino
Thanks Zedenko, PyMuPDF is an intriguing option. I'll check it out further. On Fri, Mar 8, 2024 at 6:14 AM Zdenko Podobny wrote: > Hello, > > > I am not sure if OCRmyPDF(https://ocrmypdf.readthedocs.io/en/latest/) > allows redaction. > > If you would to implement text layer by yourself with cust

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Merlijn B.W. Wajer
Hi Mark, On 07/03/2024 20:53, Mark Pellegrino wrote: I found more info here: https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277 Glyphless appears to be an 'invisible font' and all that Tesseract supports. It seems like the solution it to use Tesseract to generate hO

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Zdenko Podobny
Hello, I am not sure if OCRmyPDF(https://ocrmypdf.readthedocs.io/en/latest/) allows redaction. If you would to implement text layer by yourself with custom font, have a look at PyMuPDF: - https://github.com/pymupdf/PyMuPDF/discussions/775 (Adding text layer to a scanned PDF) - https://