Hi George,
One thing to try - in tika-mimetypes.xml, the entry for text/html has:
<magic priority="40">
<match value="<!DOCTYPE HTML" type="string" offset="0:64"/>
<match value="<!doctype html" type="string" offset="0:64"/>
<match value="<HEAD" type="string" offset="0:64"/>
<match value="<head" type="string" offset="0:64"/>
(and so on)
Try replicating the <match> entries, but with type="little16" or type="big16"
(depending on your file's encoding).
You might also need to remove the BOM from the input stream.
Let me know if that works. Feels like a Jira issue is warranted in either
case...
-- Ken
On Jun 17, 2014, at 1:03am, [email protected] wrote:
> I want to be able to detect when a file is html even when it is utf-16
> encoded. I can see from the default tika-mimetypes.xml that normally files
> with a BOM will be detected as text/plain, which is the case. I have tried
> creating my own versions of the html and text mime types in a
> custom-mimetypes.xml and these successfully overwrite the original ones but
> changing the priority of these does not force the utf-16 files to be
> identified as html. Even removing the BOM matches completely from the text
> mimetype in the custom-mimetypes.xml does not work.
>
> So I tried another approach by removing the BOM from the inputstream before
> detecting. However the utf-16 file is still not recognised as html, despite
> the tect having multiple matches. It seems that the detect method does not
> realise what encoding is being used for the file. Is there a way to tell a
> detector what encoding a file is in to aid detection?
>
> Thanks
>
> George
>
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr