Carey Halton created TIKA-4047:
----------------------------------

             Summary: Various PDF Parsing errors
                 Key: TIKA-4047
                 URL: https://issues.apache.org/jira/browse/TIKA-4047
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.4.1
         Environment: Windows 11, using Tika server /tika/body API.
            Reporter: Carey Halton
         Attachments: ML100500495 error.txt, ML100500495.PDF, ML100840685 
error.txt, ML100840685.pdf, ML22020A080 error.txt, ML22020A080.pdf

We are seeing various PDF parser errors for a few specific PDF files with Tika 
2.4.1. We were hoping that someone could help us investigate and see if there 
are bugs with the PDF parser or PDFBox that could be fixed to allow these to be 
parsed (or let us know if they are already fixed in a later version), or if 
there is just something corrupted about these particular files that makes 
parsing them impossible. I have attached the 3 files as well as txt files that 
include the exception message we are seeing for each of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to