Re: Detection of plain text files

Ken Krugler Tue, 25 Jun 2019 09:19:03 -0700

Hi Tim,

Seems like what we’d want is “isText()” vs what we’ve got, which is “isAscii()”


Any thoughts on switching to what I thought was the older algorithm, of (a) not 
many unexpected control chars, and (b) a reasonable number of line ending chars?

— Ken

> On Jun 25, 2019, at 6:56 AM, Tim Allison <talli...@apache.org> wrote:
> 
> Hi Ken,
>  I'm sorry for my delay.  I took a short chunk of Japanese and
> converted it to Shift_JIS.
> 
>  Your memory is largely correct (or we've changed the code base a
> bit).  The TextDetector makes a decision in favor of {{text/plain}} vs
> {{application/octet}} via TextStatistics
> (https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46)
> if the bytes are:
> 
> a) mostly in the ascii range (btwn 0x20 and 128) and don't have too
> many control characters
> b) kind of look like UTF-8
> 
> In the example file I used, there were 0 control, 36 ascii (btwn 0x20
> and 128) an 0 safe terms, but the total character count was 218.  The
> isAscii() requires > 90% of the characters appear btwn 0x20 and
> 128...so the text detector failed.
> 
> In short, this is an area for improvement.  I suspect our current
> mechanism would also be pretty awful on UTF-16.
> 
> On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler <kkrugler_li...@transpac.com> 
> wrote:
>> 
>> Hi devs,
>> 
>> I’m trying to remember the history of how Tika’s current mime-type detection 
>> has evolved, regarding handling of plain text files.
>> 
>> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) 
>> it gets returned as application/octet-stream.
>> 
>> I thought that previously we had something which would check if the file 
>> only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars 
>> besides these), and a reasonable number of line ending chars, and if so then 
>> we’d return text/plain instead of application/octet-stream
>> 
>> Thanks,
>> 
>> — Ken
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Re: Detection of plain text files

Reply via email to