Re: [tesseract-ocr] tesseract 4 on Debian Bullseye

Zdenko Podobny Sun, 23 Jan 2022 22:27:26 -0800

Please send the tesseract relevant file - tiff ;-) .
First think you always need to check the tesseract input. Input of your
script (pdf) is not important in this stage.





Zdenko


po 24. 1. 2022 o 4:44 Rich M <ramac...@gmail.com> napísal(a):

> Please provide details for reproducing problem: input image, output pdf,
> tesseract details (tesseract -v)
> tesseract-ocr:   Installed: 4.1.1-2.1
> convert, provided by imagemagick:  Installed: 8:6.9.11.60+dfsg-1.3 (It
> could also be an issue with convert, but I've converted the PDF with GIMP,
> but get the same results.)
> My OS is Linux, Debian Bullseye (stable)
>
> I execute the script by
> $ ./PDF2SearchablePDF.sh Sh ShockDataMeasurementsLessonsLearned.pdf
>
> The source PDF
> ShockDataMeasurementsLessonsLearned.pdf
>
> Split PDF pg 1
> PDFIn001.pdf
>
> Split PDF pg1 converted to .tiff with convert (imagemagick)
> PDFIn001.tiff
>
> Pg 1 after processing with tesseract
> PDFIn001Searchable.pdf
>
> Bash script:
> ###
> #!/bin/bash
> SourcePDF=$1
> mkdir PDFIn PDFOut TIFFIn
> pdfseparate $SourcePDF PDFIn/PDFIn%03d.pdf
> #pdfseparate InputDoc02.pdf PDFIn/PDFIn%03d.pdf
> echo $1
> cd PDFIn
> ls PDFIn*.pdf >../list.txt
> cd ..
>
> for FIL in $(<list.txt)
> do
> convert -density 300 PDFIn/${FIL} TIFFIn/${FIL/.pdf/}.tiff
> #gs -q -dNOPAUSE -r300x300 -sDEVICE=tiff32nc
> -sOutputFile=TIFFIn/${FIL/.pdf/}.tiff PDFIn/${FIL} -c quit
> tesseract TIFFIn/${FIL/.pdf/}.tiff PDFOut/${FIL/.pdf/} -l eng pdf
> done
>
> pdfunite PDFOut/PDFIn*.pdf OutputPDF.pdf
> ###
>
>
> On Fri, Jan 21, 2022 at 5:48 PM Rich M <ramac...@gmail.com> wrote:
>
>> Sure. I'll need to find a test file that doesn't contain private
>> information.
>>
>> Before seeing your response now, I ran my script on a file that I had
>> converted to a searchable PDF last year and the output file was very poor.
>> Out of curiosity, I changed the converted image from .tiff to .png and the
>> result was very good. I'm wondering if it's something with the convert
>> package.
>>
>> Rich
>>
>> On Wednesday, January 19, 2022 at 11:18:20 PM UTC-7 zdenop wrote:
>>
>>> Please provide details for reproducing problem: input image, output pdf,
>>> tesseract details (tesseract -v)
>>>
>>> Zdenko
>>>
>>>
>>> št 20. 1. 2022 o 5:03 Rich M <rama...@gmail.com> napísal(a):
>>>
>>>> Hi,
>>>>
>>>> I'm fairly new to tesseract and had a written a bash script in Debian
>>>> Buster(previous release) using tesseract 3 which worked very well. I've
>>>> since upgraded my OS to the next stable release, Bullseye which also
>>>> upgraded tesseract to V4. After the upgrade, tesseract isn't "working" any
>>>> longer. I'm needing help in troubleshooting the issue.
>>>>
>>>> Basically the important line of the script is
>>>> tesseract PDFIn001.tiff PDFOut001 -l eng pdf
>>>>
>>>> Then in the terminal,
>>>> Tesseract Open Source OCR Engine v4.1.1 with Leptonica
>>>>
>>>> The resulting PDF file is 2.4kB and appears to be empty or corrupted.
>>>>
>>>> With the previous Debian release, I didn't need to install any
>>>> "training". Is that what I'm missing?
>>>>
>>>> Thanks,
>>>> Rich
>>>>
>>>> I don't recall seeing the response in the terminal about Leptonica.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/3a998a4a-6a6c-4062-84ca-8719adfb05ffn%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3a998a4a-6a6c-4062-84ca-8719adfb05ffn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/8e60bbd9-7d15-4f92-8156-99b5dfd338d4n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/8e60bbd9-7d15-4f92-8156-99b5dfd338d4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>  PDFIn001.tiff
>> <https://drive.google.com/file/d/1j58uFxxfBZNLn_sUcDxfq6qIcBnd350u/view?usp=drive_web>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAOkzOYbhL9w2X2hoKg%3DEt7LDmbbUf35ozyWOteWFfqMtz68yDQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAOkzOYbhL9w2X2hoKg%3DEt7LDmbbUf35ozyWOteWFfqMtz68yDQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zRu%2Bu52PiwvD1qHbx20JMdz-h-igK8UfrVm14qJawQug%40mail.gmail.com.

Re: [tesseract-ocr] tesseract 4 on Debian Bullseye

Reply via email to