Re: Lucene and Google Web 1T 5 Gram

Rafael Turk Wed, 23 Apr 2008 18:21:25 -0700

Hi Mathieu,

*What do you wont to do?*


An spell checker and related keyword suggestion

If you wont an ngram => popularity map, just use a berkley DB, and use this
information in your Lucene application. Lucene is a reversed index, Berkeley
DB an index.

*Great ideia! Berkeley DB is definitely a try, simple and effective, but
I'll have to work the data previously. I was hopping to take advantage of
Lucene's built in features*
**
*[]s*
**
On Wed, Apr 23, 2008 at 10:16 AM, Mathieu Lecarme <[EMAIL PROTECTED]>
wrote:

> Rafael Turk a écrit :
>
> Hi Folks,
> >
> >   I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> > contains
> > English word n-grams and their observed frequency counts. The length of
> > the
> > n-grams ranges from unigrams(single words) to five-grams)
> >
> >   I´m loading each ngram (each row is a ngram) as an individual
> > Document.
> > This way I´ll be able to search for each ngram separated, but I´m ending
> > with huge indexes witch makes them very hard to load and read the index.
> >
> >  Is there a better way to load and read ngrams to a Lucene index? Maybe
> > using lower level api?
> >
> >
> > More Info about Google Web 1T 5 Gram corpus at:
> > <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
> >
> > Thanks,
> >
> > Rafael
> >
> >
> >
>
> What do you wont to do?
> If you wont an ngram => popularity map, just use a berkley DB, and use
> this information in your Lucene application. Lucene is a reversed index,
> Berkeley DB an index.
>
> M.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Lucene and Google Web 1T 5 Gram

Reply via email to