Hierarchical Documents

2005-08-21 Thread Rohit Lodha
Hi All, Currently, Documents cannot contain other documents. I have a Graph of Objects (Documents) to search in. I could flatten them and search but... Is there any nice way to do it? Rohit

Re: NGram Language Categorization Source

2005-08-21 Thread Otis Gospodnetic
Hello, Sounds like that LI acronym was confusing -Language Identification. Otis > > It was > > also found that the way you create ngram profiles (e.g. with or > without > > surrounding spaces, single length or mixed length) affects the LI > > performance. > > LI??? > > I haven't benchmarked

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> * A Nutch implementation: > http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/ > > * A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763 A step in the right direction. It doesn't have other language categories created though. > * JTextCat (http

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> Erhm... Not to rain on your parade, but Googling for "ngram java" gives > a lot of hits. http://sourceforge.net/projects/ngramj and also > "languageidentifier" in Nutch are two examples of Open Source Java > implementations. Each can be used with Lucene. I think I've played with ngramj and found

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf > > > > Linguini: Language Identification for Multilingual Documents > > John M. Prager > > Prager also uses an n-gram approach, so you might