Tien Nguyen Manh created TIKA-1365: -------------------------------------- Summary: Incorrectly MimeType detection for Apache Lucene web site Key: TIKA-1365 URL: https://issues.apache.org/jira/browse/TIKA-1365 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Tien Nguyen Manh
Tika 1.5 detect many page from apache lucene web site as xml, for example this page http://lucene.apache.org/core/discussion.html Here are error log:, it failed to parse becuase it use xml parser Apache Tika was unable to parse the document at http://lucene.apache.org/core/discussion.html. The full exception stack trace is included below: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018) -- This message was sent by Atlassian JIRA (v6.2#6252)