Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Ken Krugler Thu, 17 Jan 2019 11:26:02 -0800

Hi Mike,

I don’t see the script - did it get stripped?


Below is a list of the language profiles that I believe are bundled with the 
language-detector jar that’s pulled in by Tika.

I don’t see “gr” - note that Greek is “el”.

And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect 
detection to work for your test cases.

— Ken

af
an
ar
ast
be
bg
bn
br
ca
cs
cy
da
de
el
en
es
et
eu
fa
fi
fr
ga
gl
gu
he
hi
hr
ht
hu
id
is
it
ja
km
kn
ko
lt
lv
mk
ml
mr
ms
mt
ne
nl
no
oc
pa
pl
pt
ro
ru
sk
sl
so
sq
sr
sv
sw
ta
te
th
tl
tr
uk
ur
vi
yi
zh-CN
zh-TW


> On Jan 17, 2019, at 9:39 AM, Mike Thomsen <mikerthom...@gmail.com> wrote:
> 
> I wrote a Groovy script (attached) to test a bunch of languages against the 
> LanguageDetector class, and these were the results:
> 
> ar    fa
> de    de
> en    en
> es    es
> fr    fr
> gr    el
> it    it
> ko    lt
> nl    nl
> ru    ru
> zh    lt
> 
> Is there something that needs to be done to enable the detection of Asian 
> languages or should I file this as a bug report?
> 
> Thanks,
> 
> Mike

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Reply via email to