I just submitted a patch for https://issues.apache.org/jira/browse/TIKA-431
I'm hoping Reinhard, Erik, Jan & others can give it a look, since it's bigger than what I was expecting. This also addresses TIKA-539 (charset detection w/HTML), since the code being modified was intertwined. Regards, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr