[tesseract-ocr] OCRmyPDF and Tesseract not making PDFs searchable

Filippos Koliopanos Mon, 03 Jul 2023 12:24:41 -0700

Hello,

I have been trying to make PDFs searchable using OCRmyPDF and Tesseract, 
but despite following recommended steps, I have been unable to get the 
desired results.

Here is a summary of the issues I have faced:

1. Initially, I tried running OCRmyPDF on a PDF document (created by
exporting a PNG image to PDF via GIMP) using the command `ocrmypdf -l eng
OCR_test_eng.pdf outputOCR.pdf`. The process completed without errors, but
the output PDF was not searchable.

2. I then updated my Tesseract to version
5.3.1+git6228-24da4c71-1ppa1~jammy1, hoping it might resolve the problem.
However, the issue persisted.

3. I also attempted using the `--force-ocr` option with OCRmyPDF, but the
output PDF remained unsearchable. Interestingly, for a scanned PDF
document, OCRmyPDF indicated that the document already had text, even
though it was not searchable.

4. To rule out problems with OCRmyPDF, I tried using pdfsandwich for OCR.
However, it reported that Tesseract was unable to produce a PDF output
file, suggesting that the problem might be with Tesseract itself.

5. I am running these commands on a Linux system Ubuntu 22.04.2 LTS

I have had no success with previous attempts at using Tesseract for OCR on
Linux, and I'm hoping to finally resolve this issue. Any guidance would be
greatly appreciated.

Best,
Filippos
---

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/f84fa2a7-85be-46b8-bbf8-2d7ab605e324n%40googlegroups.com.

[tesseract-ocr] OCRmyPDF and Tesseract not making PDFs searchable

Reply via email to