Also, I feel compelled to mention that I think I have seen this on some of my unupdated VMs running 4.1.1, also built from source, on the same document. Sorry for the spam, I wish I could edit. I think it may be tied to leptonica specifically or something else in the environment? The same version of Tesseract was working before I updated Ubuntu to version 20.04, which leads me to think it would be some kind of dependency.
On Tuesday, June 7, 2022 at 9:02:38 AM UTC-5 Lucas L. wrote: > Sure, I will write that up. Thanks for helping, zdenop. Would you happen > to know which is the most recent version that does not exhibit this issue > so I can switch to that? > > On Tuesday, June 7, 2022 at 12:27:08 AM UTC-5 zdenop wrote: > >> Can you please create an issue at >> https://github.com/tesseract-ocr/tesseract/issues? >> >> I confirm a problem with recent tesseract and leptonica, so it should be >> fixed for the next release... >> >> Zdenko >> >> >> po 6. 6. 2022 o 22:47 Lucas L. <infinit...@gmail.com> napísal(a): >> >>> OK, I have a sample document to share now. I've pulled out one page from >>> a document exhibiting this error that does not have any identifying >>> information on it. >>> I noticed in the process of doing this, that the same original document >>> (they usually come in as PDFs) split into TIFFs by other applications >>> (i.e., FoxIt) don't seem to run into issues. The TIFFs are not invalid when >>> I look at them on my personal PC. However when the document goes through >>> our pipeline and is split into TIFFs in preparation for being OCR'd, >>> Tesseract throws the "defaultPdfEncoding" error mentioned above. >>> Unfortunately unless I know exactly what about this document is causing >>> this, I won't be able to address it in our pipeline. >>> >>> On Monday, June 6, 2022 at 12:00:45 PM UTC-5 Lucas L. wrote: >>> >>>> No luck sadly, when I edited the image in Irfanview to block out the >>>> sensitive parts and tried to OCR it again, the error didn't occur. I'm not >>>> sure what changed in the .tiff image file. Any ideas on what kind of image >>>> metadata can possibly cause this "selectDefaultPdfEncoding" error? >>>> >>>> Only differences I can notice between the two files is that the >>>> original has 16 BPP color depth. They both have LZW compression. >>>> >>>> On Monday, June 6, 2022 at 11:47:31 AM UTC-5 Lucas L. wrote: >>>> >>>>> Oh yeah, here's the output of tessdata -v: >>>>> >>>>> tesseract 5.1.0 >>>>> leptonica-1.79.0 >>>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : >>>>> libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 >>>>> Found AVX2 >>>>> Found AVX >>>>> Found FMA >>>>> Found SSE4.1 >>>>> Found OpenMP 201511 >>>>> Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 >>>>> liblz4/1.9.2 libzstd/1.4.4 >>>>> >>>>> On Monday, June 6, 2022 at 11:46:30 AM UTC-5 Lucas L. wrote: >>>>> >>>>>> It seems to be specific to the document in question. However I'm >>>>>> afraid I can't post the document because it has sensitive information on >>>>>> it. I guess I can try to scrub the info using an image editing tool and >>>>>> see >>>>>> if the error still occurs. >>>>>> >>>>>> On Monday, June 6, 2022 at 11:21:25 AM UTC-5 zdenop wrote: >>>>>> >>>>>>> Can you please share ocrIn_1.tif + info which tessdata version you >>>>>>> use? >>>>>>> + output of 'tesseract -v' >>>>>>> >>>>>>> Zdenko >>>>>>> >>>>>>> >>>>>>> po 6. 6. 2022 o 17:53 Lucas L. <infinit...@gmail.com> napísal(a): >>>>>>> >>>>>>>> Hi, I'm trying to upgrade Tesseract in our Ubuntu 20.04 VMs used >>>>>>>> to OCR documents to Tesseract 5.1 from 4.1.1, both versions were built >>>>>>>> from >>>>>>>> source on that VM. 4.1.1 worked, but 5.1 throws an error that I can't >>>>>>>> seem >>>>>>>> to find anywhere else online: >>>>>>>> >>>>>>>> sudo -u userx tesseract --loglevel ALL --oem 1 -l eng >>>>>>>> /opt/.../pdfprocessor/test/ocr-working/1/ocrIn_1.tif >>>>>>>> /opt/.../pdfprocessor/test/test pdf >>>>>>>> Error in selectDefaultPdfEncoding: type selection failure >>>>>>>> Error during processing. >>>>>>>> >>>>>>>> I have tried the training data from both "tessdata" and >>>>>>>> "tessdata_best" and got the same error. Any help would be appreciated. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Lucas LeBlanc >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/152131f0-7e80-446c-9b91-7ef7acfe87d3n%40googlegroups.com.