[ https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366176#comment-14366176 ]
ASF GitHub Bot commented on TIKA-1365: -------------------------------------- GitHub user mkr opened a pull request: https://github.com/apache/tika/pull/35 TIKA-1365: Lower priority for XML starting with comment TIKA-1365: Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkr/tika TIKA-1365 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/35.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #35 ---- commit f9655d44978af188018bee81b2d554770ddcd7f9 Author: Matthias Krueger <m...@mkr.io> Date: 2015-03-17T21:45:36Z TIKA-1365: Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html ---- > Incorrectly MimeType detection for Apache Lucene web site > --------------------------------------------------------- > > Key: TIKA-1365 > URL: https://issues.apache.org/jira/browse/TIKA-1365 > Project: Tika > Issue Type: Bug > Components: detector > Affects Versions: 1.5 > Reporter: Tien Nguyen Manh > Attachments: discussion.html > > > Tika 1.5 detect many page from apache lucene web site as xml, for example > this page > http://lucene.apache.org/core/discussion.html > Here are error log:, it failed to parse becuase it use xml parser > Apache Tika was unable to parse the document > at http://lucene.apache.org/core/discussion.html. > The full exception stack trace is included below: > org.apache.tika.exception.TikaException: XML parse error > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) > at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293) > at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247) > at > javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018) -- This message was sent by Atlassian JIRA (v6.3.4#6332)