Re: Detecting html file which is urf-16 encoded

Ken Krugler Tue, 17 Jun 2014 06:02:26 -0700

Hi George,

One thing to try - in tika-mimetypes.xml, the entry for text/html has:


    <magic priority="40">
      <match value="&lt;!DOCTYPE HTML" type="string" offset="0:64"/>
      <match value="&lt;!doctype html" type="string" offset="0:64"/>
      <match value="&lt;HEAD" type="string" offset="0:64"/>
      <match value="&lt;head" type="string" offset="0:64"/>

(and so on)

Try replicating the <match> entries, but with type="little16" or type="big16" 
(depending on your file's encoding).

You might also need to remove the BOM from the input stream.

Let me know if that works. Feels like a Jira issue is warranted in either 
case...

-- Ken





On Jun 17, 2014, at 1:03am, [email protected] wrote:

> I want to be able to detect when a file is html even when it is utf-16 
> encoded. I can see from the default tika-mimetypes.xml that normally files 
> with a BOM will be detected as text/plain, which is the case.  I have tried 
> creating my own versions of the html and text mime types in a 
> custom-mimetypes.xml and these successfully overwrite the original ones but 
> changing the priority of these does not force the utf-16 files to be 
> identified as html. Even removing the BOM matches completely from the text 
> mimetype in the custom-mimetypes.xml does not work. 
> 
> So I tried another approach by removing the BOM from the inputstream before 
> detecting. However the utf-16 file is still not recognised as html, despite 
> the tect having multiple matches. It seems that the detect method does not 
> realise what encoding is being used for the file. Is there a way to tell a 
> detector what encoding a file is in to aid detection?
> 
> Thanks
> 
> George
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Detecting html file which is urf-16 encoded

Reply via email to