Re: [tesseract-ocr] libtesseract skip OCR, just create invisible text layer

Merlijn Wajer Wed, 12 Jul 2023 09:57:04 -0700

Hi,

On Tue Jul   4 22:40:16 2023 lbr <lbr7...@gmail.com> wrote:
> I'm trying to create a searchable pdf out of a scanned one. I want to
> use   Textract as an OCR engine instead of Tesseract. Is there a way to
> make   libtesseract skip the OCR step and just create the invisible text
> layer   (with the extracted chars from Textract) and apply it to the
> input pdf?   
> 
> I read that libtesseract is what ocrmypdf uses to create the invisible
> text   layer for searchable pdfs.


You can use archive-pdf-tools to do this: 
https://github.com/internetarchive/archive-pdf-tools

it has a Python version of the Tesseract text layer generation and can take 
hOCR as input (you can convert other OCR formats to hOCR). Note that it is not 
100% the same as Tesseract currently - I am trying to find the difference/bug 
in my port.

I am the author, so feel free to reach out if you have any questions.

Regards,
Merlijn
-- 
Sent from my Motorola Droid 4 running Maemo Leste (Beowulf)

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1689181000.27601.14.camel%40localhost.

Re: [tesseract-ocr] libtesseract skip OCR, just create invisible text layer

Reply via email to