Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

via GitHub Tue, 09 Jul 2024 16:37:15 -0700


kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2218995629


   Thank you. That worked but I bumped into a new issue now after working 
through few other huccups. 
   I am trying to parse a ppt file.
   
   ```
   import org.apache.tika.io.TikaInputStream;
   import org.apache.tika.metadata.Metadata;
   import org.apache.tika.parser.AutoDetectParser;
   import org.apache.tika.parser.ParseContext;
   import org.apache.tika.parser.Parser;
   import org.apache.tika.sax.BodyContentHandler;
   import org.apache.tika.sax.OfflineContentHandler;
   import org.apache.tika.parser.ocr.TesseractOCRConfig;
   
       TesseractOCRConfig config = new TesseractOCRConfig();
       config.setSkipOcr(true);
       ParseContext context = new ParseContext();
       context.set(TesseractOCRConfig.class, config);
   
       Parser parser = new AutoDetectParser();
       Metadata metadata = new Metadata();
       OfflineContentHandler handler = new OfflineContentHandler(new 
BodyContentHandler(writer));
   
       // Note: here we have to use TikaInputStream.get, otherwise certain 
content type (e.g. 2007
       // pptx) might not be correctly detected by the parser
       try (InputStream original = TikaInputStream.get(input, metadata)) {
         parser.parse(original, handler, metadata, context); 
                    ==> Above call is crashing with
           Execution error (NoSuchMethodError) at 
org.apache.poi.util.IOUtils/toByteArray (IOUtils.java:241).
   'org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream$Builder 
org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.builder()'
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

Reply via email to