Hi Mike, Sounds like a great idea - see the recent comment thread on https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
And there's also https://issues.apache.org/jira/browse/TIKA-539 Also, what will you be using to test language detection? WIkipedia pages? -- Ken On Oct 24, 2011, at 7:29pm, Michael McCandless wrote: > I've only scratched the surface in figuring out how CLD > works... excising the code and exposing a Python wrapper is much > easier than actually understanding it! > > It has some neat features, like passing in three possible "hints": > > * domain extension (fr boosts French) > > * declared encoding > > * declared language > > It uses these hints to set pre-computed priors for top 3 languages. > > It can optionally "abstain" from guessing if it thinks it's not very > confident for certain matches. It has an overall "reliable" bool that > comes back, which is true if the match is high confidence (like Tika's > isReasonablyCertain, though that's per-match). > > But, you can't [easily] limit up front the set of languages to test > like you can with Tika (I think? You can just .addProfile() for each > language you want? Hmm though loading a LanguageProfile from a .ngp > file looks like it's private inside LanguageIdentifier). > > I'm trying to test Tika vs CLD vs the java language detect library > (http://code.google.com/p/language-detection)... hoping to finish that > soon and do a followon blog post. > > Mike McCandless > > http://blog.mikemccandless.com > > On Mon, Oct 24, 2011 at 9:45 AM, Ken Krugler > <kkrugler_li...@transpac.com> wrote: >> I took a quick look just now, though it's not really documented yet - in the >> process of being separated from inside of Chrome. >> >> But looks like they store pre-calculated compression models for languages, >> and then figure out which model works best on the text being analyzed (which >> implies it has bytes with similar probabilistic distribution/sequencing). >> >> -- Ken >> >> On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote: >> >>> Hi, >>> >>> I just find this blog post from Mike McCandless about Google's Compact >>> Language Detection code used in Chrome : >>> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html >>> >>> There's probably some interesting things to explore in the Google Code in >>> order to improve Tika's Language Detection. >>> Did someone allready take a look at Google CLD code ? >>> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/ >>> >>> Best regards >>> >>> Jérôme >>> >>> -- >>> @jcharron >>> http://motre.ch/ >>> http://jcharron.posterous.com/ >>> http://www.shopreflex.fr/ >>> http://www.staragora.com/ >>> >>> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1> >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://bixolabs.com >> custom big data solutions & training >> Hadoop, Cascading, Mahout & Solr >> >> >> >> -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr