Hi Ken,
  I'm sorry for my delay.  I took a short chunk of Japanese and
converted it to Shift_JIS.

  Your memory is largely correct (or we've changed the code base a
bit).  The TextDetector makes a decision in favor of {{text/plain}} vs
{{application/octet}} via TextStatistics
(https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46)
if the bytes are:

a) mostly in the ascii range (btwn 0x20 and 128) and don't have too
many control characters
b) kind of look like UTF-8

In the example file I used, there were 0 control, 36 ascii (btwn 0x20
and 128) an 0 safe terms, but the total character count was 218.  The
isAscii() requires > 90% of the characters appear btwn 0x20 and
128...so the text detector failed.

In short, this is an area for improvement.  I suspect our current
mechanism would also be pretty awful on UTF-16.

On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler <kkrugler_li...@transpac.com> wrote:
>
> Hi devs,
>
> I’m trying to remember the history of how Tika’s current mime-type detection 
> has evolved, regarding handling of plain text files.
>
> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) 
> it gets returned as application/octet-stream.
>
> I thought that previously we had something which would check if the file only 
> had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides 
> these), and a reasonable number of line ending chars, and if so then we’d 
> return text/plain instead of application/octet-stream
>
> Thanks,
>
> — Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>

Reply via email to