On Jul 11, 2014, at 8:01am, Avi Hayun <avrah...@gmail.com> wrote:

> Hi,
> 
> Scenario:
> 1. I use tika-core in my app
> 2. I use the following to detect the stream's media type:
> 
> byte[] bytes = IOUtils.toByteArray(new URL("http://www.amazon.com/sitemap_
> video.xml"));
> String contentType = new Tika().detect(bytes);
> 
> obviously when looking at the sitemap - it is of type application/XML
> 
> BUT
> 
> Tika returns content type of: plain/text instead of application/xml   !?
> 
> Upon debugging, I get to the following class:
> CompositeDetector.detect(InputStream input, Metadata metadata)...
> 
> Which returns the wrong content type.
> 
> ANyone has any idea how to solve it?


The returned content starts with

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9";
        xmlns:video="http://www.google.com/schemas/sitemap-video/1.0";>

Which is why it isn't detected as XML, given the current set of strings being 
used for matching in tika-mimetypes.xml

You could put into the metadata tthe returned Content-type header, which is 
text/xml for the above example, and then I think it would work.

But we should also beef up XML detection, e.g. with a pattern like <blah xmlns="

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to