On Jul 11, 2014, at 8:01am, Avi Hayun <avrah...@gmail.com> wrote: > Hi, > > Scenario: > 1. I use tika-core in my app > 2. I use the following to detect the stream's media type: > > byte[] bytes = IOUtils.toByteArray(new URL("http://www.amazon.com/sitemap_ > video.xml")); > String contentType = new Tika().detect(bytes); > > obviously when looking at the sitemap - it is of type application/XML > > BUT > > Tika returns content type of: plain/text instead of application/xml !? > > Upon debugging, I get to the following class: > CompositeDetector.detect(InputStream input, Metadata metadata)... > > Which returns the wrong content type. > > ANyone has any idea how to solve it?
The returned content starts with <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.0"> Which is why it isn't detected as XML, given the current set of strings being used for matching in tika-mimetypes.xml You could put into the metadata tthe returned Content-type header, which is text/xml for the above example, and then I think it would work. But we should also beef up XML detection, e.g. with a pattern like <blah xmlns=" -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr