Tien Nguyen Manh created TIKA-1365:
--------------------------------------

             Summary: Incorrectly MimeType detection for Apache Lucene web site
                 Key: TIKA-1365
                 URL: https://issues.apache.org/jira/browse/TIKA-1365
             Project: Tika
          Issue Type: Bug
          Components: detector
    Affects Versions: 1.5
            Reporter: Tien Nguyen Manh


Tika 1.5 detect many page from apache lucene web site as xml, for example this 
page 
http://lucene.apache.org/core/discussion.html

Here are error log:, it failed to parse becuase it use xml parser

Apache Tika was unable to parse the document
at http://lucene.apache.org/core/discussion.html.

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: XML parse error
        at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
        at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293)
        at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247)
        at 
javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to