On Aug 29, 2012, at 8:55am, chraj007 wrote: > Hello, > Im trying to parse a file whose content type is UTF-16. Im unable to > parse the document using the following code. Please Help me. > > ContentHandler textHandler = new BodyContentHandler(); > TeeContentHandler teeHandler = new > TeeContentHandler(textHandler); > parser.parse(input, teeHandler, metadata, context);
Note that you don't need to use a TeeContentHandler here. > String tt = textHandler.toString(); > //to print the text > > byte[] converttoBytes = tt.getBytes("UTF-16"); > String string = new String(converttoBytes, "utf-8"); The above code won't do what I think you're hoping it will do. The call to getBytes("UTF-16") will return the tt string as character data encoded using UTF-16. The second call says to generate a string from bytes that are character data encoding using UTF-8 (which obviously isn't true). > System.out.println(string); > > but its printing along with all html tags. I'm unclear on what you mean by this. But as Jukka noted in his response, the issue is that you have a document which is encoded as UTF-8, but the HTML has <meta http-equiv="Content-Type" content="text/html; charset=UTF-16"> Currently Tika treats this meta tag charset as the truth. See https://issues.apache.org/jira/browse/TIKA-539 for a discussion on this issue. Regards, -- Ken -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr