Please send the tesseract relevant file - tiff ;-) . First think you always need to check the tesseract input. Input of your script (pdf) is not important in this stage.
Zdenko po 24. 1. 2022 o 4:44 Rich M <ramac...@gmail.com> napísal(a): > Please provide details for reproducing problem: input image, output pdf, > tesseract details (tesseract -v) > tesseract-ocr: Installed: 4.1.1-2.1 > convert, provided by imagemagick: Installed: 8:6.9.11.60+dfsg-1.3 (It > could also be an issue with convert, but I've converted the PDF with GIMP, > but get the same results.) > My OS is Linux, Debian Bullseye (stable) > > I execute the script by > $ ./PDF2SearchablePDF.sh Sh ShockDataMeasurementsLessonsLearned.pdf > > The source PDF > ShockDataMeasurementsLessonsLearned.pdf > > Split PDF pg 1 > PDFIn001.pdf > > Split PDF pg1 converted to .tiff with convert (imagemagick) > PDFIn001.tiff > > Pg 1 after processing with tesseract > PDFIn001Searchable.pdf > > Bash script: > ### > #!/bin/bash > SourcePDF=$1 > mkdir PDFIn PDFOut TIFFIn > pdfseparate $SourcePDF PDFIn/PDFIn%03d.pdf > #pdfseparate InputDoc02.pdf PDFIn/PDFIn%03d.pdf > echo $1 > cd PDFIn > ls PDFIn*.pdf >../list.txt > cd .. > > for FIL in $(<list.txt) > do > convert -density 300 PDFIn/${FIL} TIFFIn/${FIL/.pdf/}.tiff > #gs -q -dNOPAUSE -r300x300 -sDEVICE=tiff32nc > -sOutputFile=TIFFIn/${FIL/.pdf/}.tiff PDFIn/${FIL} -c quit > tesseract TIFFIn/${FIL/.pdf/}.tiff PDFOut/${FIL/.pdf/} -l eng pdf > done > > pdfunite PDFOut/PDFIn*.pdf OutputPDF.pdf > ### > > > On Fri, Jan 21, 2022 at 5:48 PM Rich M <ramac...@gmail.com> wrote: > >> Sure. I'll need to find a test file that doesn't contain private >> information. >> >> Before seeing your response now, I ran my script on a file that I had >> converted to a searchable PDF last year and the output file was very poor. >> Out of curiosity, I changed the converted image from .tiff to .png and the >> result was very good. I'm wondering if it's something with the convert >> package. >> >> Rich >> >> On Wednesday, January 19, 2022 at 11:18:20 PM UTC-7 zdenop wrote: >> >>> Please provide details for reproducing problem: input image, output pdf, >>> tesseract details (tesseract -v) >>> >>> Zdenko >>> >>> >>> št 20. 1. 2022 o 5:03 Rich M <rama...@gmail.com> napísal(a): >>> >>>> Hi, >>>> >>>> I'm fairly new to tesseract and had a written a bash script in Debian >>>> Buster(previous release) using tesseract 3 which worked very well. I've >>>> since upgraded my OS to the next stable release, Bullseye which also >>>> upgraded tesseract to V4. After the upgrade, tesseract isn't "working" any >>>> longer. I'm needing help in troubleshooting the issue. >>>> >>>> Basically the important line of the script is >>>> tesseract PDFIn001.tiff PDFOut001 -l eng pdf >>>> >>>> Then in the terminal, >>>> Tesseract Open Source OCR Engine v4.1.1 with Leptonica >>>> >>>> The resulting PDF file is 2.4kB and appears to be empty or corrupted. >>>> >>>> With the previous Debian release, I didn't need to install any >>>> "training". Is that what I'm missing? >>>> >>>> Thanks, >>>> Rich >>>> >>>> I don't recall seeing the response in the terminal about Leptonica. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/3a998a4a-6a6c-4062-84ca-8719adfb05ffn%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/3a998a4a-6a6c-4062-84ca-8719adfb05ffn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/8e60bbd9-7d15-4f92-8156-99b5dfd338d4n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/8e60bbd9-7d15-4f92-8156-99b5dfd338d4n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> PDFIn001.tiff >> <https://drive.google.com/file/d/1j58uFxxfBZNLn_sUcDxfq6qIcBnd350u/view?usp=drive_web> >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAOkzOYbhL9w2X2hoKg%3DEt7LDmbbUf35ozyWOteWFfqMtz68yDQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAOkzOYbhL9w2X2hoKg%3DEt7LDmbbUf35ozyWOteWFfqMtz68yDQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zRu%2Bu52PiwvD1qHbx20JMdz-h-igK8UfrVm14qJawQug%40mail.gmail.com.