Also, I feel compelled to mention that I think I have seen this on some of 
my unupdated VMs running 4.1.1, also built from source, on the same 
document. Sorry for the spam, I wish I could edit. I think it may be tied 
to leptonica specifically or something else in the environment? The same 
version of Tesseract was working before I updated Ubuntu to version 20.04, 
which leads me to think it would be some kind of dependency.

On Tuesday, June 7, 2022 at 9:02:38 AM UTC-5 Lucas L. wrote:

> Sure, I will write that up. Thanks for helping, zdenop. Would you happen 
> to know which is the most recent version that does not exhibit this issue 
> so I can switch to that?
>
> On Tuesday, June 7, 2022 at 12:27:08 AM UTC-5 zdenop wrote:
>
>> Can you please create an issue at 
>> https://github.com/tesseract-ocr/tesseract/issues?
>>
>> I confirm a problem with recent tesseract and leptonica, so it should be 
>> fixed for the next release...
>>
>> Zdenko
>>
>>
>> po 6. 6. 2022 o 22:47 Lucas L. <infinit...@gmail.com> napísal(a):
>>
>>> OK, I have a sample document to share now. I've pulled out one page from 
>>> a document exhibiting this error that does not have any identifying 
>>> information on it.
>>> I noticed in the process of doing this, that the same original document 
>>> (they usually come in as PDFs) split into TIFFs by other applications 
>>> (i.e., FoxIt) don't seem to run into issues. The TIFFs are not invalid when 
>>> I look at them on my personal PC. However when the document goes through 
>>> our pipeline and is split into TIFFs in preparation for being OCR'd, 
>>> Tesseract throws the "defaultPdfEncoding" error mentioned above. 
>>> Unfortunately unless I know exactly what about this document is causing 
>>> this, I won't be able to address it in our pipeline.
>>>
>>> On Monday, June 6, 2022 at 12:00:45 PM UTC-5 Lucas L. wrote:
>>>
>>>> No luck sadly, when I edited the image in Irfanview to block out the 
>>>> sensitive parts and tried to OCR it again, the error didn't occur. I'm not 
>>>> sure what changed in the .tiff image file. Any ideas on what kind of image 
>>>> metadata can possibly cause this "selectDefaultPdfEncoding" error? 
>>>>
>>>> Only differences I can notice between the two files is that the 
>>>> original has 16 BPP color depth. They both have LZW compression.
>>>>
>>>> On Monday, June 6, 2022 at 11:47:31 AM UTC-5 Lucas L. wrote:
>>>>
>>>>> Oh yeah, here's the output of tessdata -v:
>>>>>
>>>>> tesseract 5.1.0
>>>>>  leptonica-1.79.0
>>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : 
>>>>> libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
>>>>>  Found AVX2
>>>>>  Found AVX
>>>>>  Found FMA
>>>>>  Found SSE4.1
>>>>>  Found OpenMP 201511
>>>>>  Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 
>>>>> liblz4/1.9.2 libzstd/1.4.4
>>>>>
>>>>> On Monday, June 6, 2022 at 11:46:30 AM UTC-5 Lucas L. wrote:
>>>>>
>>>>>> It seems to be specific to the document in question. However I'm 
>>>>>> afraid I can't post the document because it has sensitive information on 
>>>>>> it. I guess I can try to scrub the info using an image editing tool and 
>>>>>> see 
>>>>>> if the error still occurs.
>>>>>>
>>>>>> On Monday, June 6, 2022 at 11:21:25 AM UTC-5 zdenop wrote:
>>>>>>
>>>>>>> Can you please share  ocrIn_1.tif + info which tessdata version you 
>>>>>>> use?
>>>>>>> + output of 'tesseract -v'
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> po 6. 6. 2022 o 17:53 Lucas L. <infinit...@gmail.com> napísal(a):
>>>>>>>
>>>>>>>> Hi, I'm trying to upgrade Tesseract in our Ubuntu 20.04 VMs used 
>>>>>>>> to OCR documents to Tesseract 5.1 from 4.1.1, both versions were built 
>>>>>>>> from 
>>>>>>>> source on that VM. 4.1.1 worked, but 5.1 throws an error that I can't 
>>>>>>>> seem 
>>>>>>>> to find anywhere else online:
>>>>>>>>
>>>>>>>> sudo -u userx tesseract --loglevel ALL --oem 1 -l eng 
>>>>>>>> /opt/.../pdfprocessor/test/ocr-working/1/ocrIn_1.tif 
>>>>>>>> /opt/.../pdfprocessor/test/test pdf
>>>>>>>> Error in selectDefaultPdfEncoding: type selection failure
>>>>>>>> Error during processing.
>>>>>>>>
>>>>>>>> I have tried the training data from both "tessdata" and 
>>>>>>>> "tessdata_best" and got the same error. Any help would be appreciated.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Lucas LeBlanc
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/152131f0-7e80-446c-9b91-7ef7acfe87d3n%40googlegroups.com.

Reply via email to