[ https://issues.apache.org/jira/browse/TIKA-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946321#comment-17946321 ]
Tilman Hausherr edited comment on TIKA-4398 at 4/22/25 8:02 AM: ---------------------------------------------------------------- please check the test code. {code:java} List<Class<? extends Parser>> excludeParsers = Arrays.asList( MP4Parser.class, AudioParser.class, Mp3Parser.class, MidiParser.class, FLVParser.class, CompressorParser.class, RarParser.class ); TikaConfig config = TikaConfig.getDefaultConfig(); Parser myParser = new DefaultParser(config.getMediaTypeRegistry(), new ServiceLoader(), excludeParsers); Parser parser = new AutoDetectParser(config.getDetector(), myParser); try { ContentHandler contentHandler = new BodyContentHandler(); Metadata meta = new Metadata(); ParseContext context = new ParseContext(); context.set(Parser.class, parser); InputStream stream = new FileInputStream("output/01.docx"); parser.parse(stream, contentHandler, meta, context); System.out.println(contentHandler.toString()); System.out.println(meta.toString()); }catch (Throwable e){ e.printStackTrace(); } {code} It detects the content-type=application/zip meta info: {noformat} X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser X-TIKA:Parsed-By=org.apache.tika.parser.pkg.PackageParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.pkg.PackageParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.xml.DcXMLParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.image.ImageParser X-TIKA:detectedEncoding=ISO-8859-1 X-TIKA:encodingDetector=UniversalEncodingDetector Content-Type=application/zip {noformat} my original code limited the embed media, can't get the xmlParser was (Author: JIRAUSER281021): please check the test code. List<Class<? extends Parser>> excludeParsers = Arrays.asList( MP4Parser.class, AudioParser.class, Mp3Parser.class, MidiParser.class, FLVParser.class, CompressorParser.class, RarParser.class ); TikaConfig config = TikaConfig.getDefaultConfig(); Parser myParser = new DefaultParser(config.getMediaTypeRegistry(), new ServiceLoader(), excludeParsers); Parser parser = new AutoDetectParser(config.getDetector(), myParser); try { ContentHandler contentHandler = new BodyContentHandler(); Metadata meta = new Metadata(); ParseContext context = new ParseContext(); context.set(Parser.class, parser); InputStream stream = new FileInputStream("output/01.docx"); parser.parse(stream, contentHandler, meta, context); System.out.println(contentHandler.toString()); System.out.println(meta.toString()); }catch (Throwable e){ e.printStackTrace(); } It detect the content-type=application/zip ``` meta info: X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser X-TIKA:Parsed-By=org.apache.tika.parser.pkg.PackageParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.pkg.PackageParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.xml.DcXMLParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.image.ImageParser X-TIKA:detectedEncoding=ISO-8859-1 X-TIKA:encodingDetector=UniversalEncodingDetector Content-Type=application/zip ``` my original code limited the embed media, can't get the xmlParser > When extracting a docx file with Tika 3.1.0, the package parser was detected > instead of the OOXML parser > -------------------------------------------------------------------------------------------------------- > > Key: TIKA-4398 > URL: https://issues.apache.org/jira/browse/TIKA-4398 > Project: Tika > Issue Type: Bug > Components: tika-core > Affects Versions: 3.1.0 > Environment: java17 > Reporter: mannixli > Priority: Major > Attachments: 01.docx, image-2025-04-16-20-46-07-228.png, > image-2025-04-22-11-26-09-936.png, image-2025-04-22-11-27-33-655.png, > image-2025-04-22-11-37-15-401.png > > > 3.0.0 detected ooxml parser -- This message was sent by Atlassian Jira (v8.20.10#820010)