On Oct 25, 2011, at 6:12pm, Michael McCandless wrote:

> OK I posted the 3rd post about CLD, this time testing perf by
> comparing to Tika and language-detection (Google Code project):
> 
>    
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
> 
> Net/net all three do very well (>= 97% accuracy); I had to remove 4
> languages from consideration because we don't support them.
> 
> Tika seems to have a lot of trouble with Spanish (confuses w/
> Galician) and Danish (confuses with Dutch).
> 
> Also, Tika's performance is substantially slow than the other two... not
> sure what's up.

I'm not surprised that Tika is slower than CLD, given the highly optimized 
nature of that code. Though 2 orders of magnitude is...painful.

I took a swing at this a while back, but didn't complete the patch.

The main issues I tried to solve were:

 - Tika processes all of the text in the document, which (for longer documents) 
slows it down significantly, versus sampling up to some limit.

 - The ProfilingWriter is very inefficient. Every character processed does an 
array copy, and every three characters triggers a new String()

-- Ken

> http://blog.mikemccandless.com
> 
> On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>> On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
>> <kkrugler_li...@transpac.com> wrote:
>> 
>>> Sounds like a great idea - see the recent comment thread on 
>>> https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
>>> 
>>> And there's also https://issues.apache.org/jira/browse/TIKA-539
>> 
>> Those do look related (if you swap charset in for language)!
>> 
>> It's tricky to know just how much to "trust" what the server
>> (Content-Type HTTP header) and content (http-equiv meta tag) says,
>> though I do like CLD's approach: they never fully "trust" what was
>> declared but rather use the declaration as a hint to boost language
>> priors.
>> 
>> And then to figure out what priors to assign for each hint they have
>> these tables trained from a large content set (10% of Base).
>> 
>> If we have access to a biggish crawl we could presumably do something
>> similar, ie record how often the hint is wrong and translate that into
>> appropriate prior boosts, ie make it a hint instead of fully trusting
>> it.
>> 
>> Does anyone know how ICU translates the encoding "hint" into priors
>> for each encoding?
>> 
>>> Also, what will you be using to test language detection? WIkipedia pages?
>> 
>> I'm using the corpus from here:
>> 
>>    
>> http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
>> 
>> It's a random subset of europarl (1000 strings from each of 21 langs).
>> 
>> Wikipedia would be great too!
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



Reply via email to