Re: Lucene and Google Web 1T 5 Gram

Rafael Turk Wed, 23 Apr 2008 18:13:53 -0700

Thanks Julien,

 I´ll definitely give it a try!!!


[]s

Rafael

On Wed, Apr 23, 2008 at 8:38 AM, Julien Nioche <
[EMAIL PROTECTED]> wrote:

> Hi Raphael,
>
> We initially tried to do the same but ended up developing our own API for
> querying the Web 1T. You can find more details on
> http://digitalpebble.com/resources.html
> There could be a way to reuse elements from Lucene e.g. the Term index
> only
> but I could not find an obvious way to achieve that.
>
> Best,
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> On 23/04/2008, Rafael Turk <[EMAIL PROTECTED]> wrote:
> >
> > Hi Folks,
> >
> >    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> > contains
> > English word n-grams and their observed frequency counts. The length of
> > the
> > n-grams ranges from unigrams(single words) to five-grams)
> >
> >    I´m loading each ngram (each row is a ngram) as an individual
> Document.
> > This way I´ll be able to search for each ngram separated, but I´m ending
> > with huge indexes witch makes them very hard to load and read the index.
> >
> >   Is there a better way to load and read ngrams to a Lucene index? Maybe
> > using lower level api?
> >
> >
> > More Info about Google Web 1T 5 Gram corpus at:
> > <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
> >
> > Thanks,
> >
> >
> > Rafael
> >
>

Re: Lucene and Google Web 1T 5 Gram

Reply via email to