Sorry, I wrote this stuff, but forgot the naming. Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: yu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, March 26, 2008 12:04:33 AM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Otis, I checked that contrib before and could not find NgramStemFilter. Am I missing other contrib? Thanks for the link! Jay Otis Gospodnetic wrote: > Hi Jay, > > Sorry, lapsus calami, that would be Lucene *contrib*. > Have a look: > http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message ---- > From: Jay <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, March 25, 2008 6:15:54 PM > Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Sorry, I could not find the filter in the 2.3 API class list (core + > contrib + test). I am not ware of lucene config file either. Could you > please tell me where it is in 2.3 release? > > Thanks! > > Jay > > Otis Gospodnetic wrote: > >> Jay, >> >> Have a look at Lucene config, it's all there, including tests. This filter >> will take a token such as "foobar" and chop it up into n-grams (e.g. foobar >> -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram >> size and even min and max n-gram size. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> ----- Original Message ---- >> From: Jay <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Tuesday, March 25, 2008 1:32:24 PM >> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 >> >> Hi Uwe, >> >> I am curious what NGramStemFilter is? Is it a combination of porter >> stemming and word ngram identification? >> >> Thanks! >> >> Jay >> >> Uwe Goetzke wrote: >> >>> Hi Ivan, >>> No, we do not use StandardAnalyser or StandardTokenizer. >>> >>> Most data is processed by >>> fTextTokenStream = result = new >>> org.apache.lucene.analysis.WhitespaceTokenizer(reader); >>> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter >>> modified that ö -> oe >>> result = new org.apache.lucene.analysis.LowerCaseFilter(result); >>> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); >>> //just a bigram tokenizer >>> >>> We use our own queryparser. The bigramms are searched with a tolerant >>> phrase query, scoring in a doc the greatest bigramms clusters covering the >>> phrase token. >>> >>> Best Regards >>> >>> Uwe >>> >>> -----Ursprüngliche Nachricht----- >>> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] >>> Gesendet: Freitag, 21. März 2008 16:25 >>> An: java-user@lucene.apache.org >>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 >>> >>> Hi Uwe, >>> >>> Could you tell what Analyzer do you use when you marked so big indexing >>> speedup? >>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the >>> reason is in it. You can see the pre last report in the thread "Indexing >>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake >>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl >>> that now is generated by JFlex instead of JavaCC. >>> I am asking because I noticed a great speedup in adding documents to >>> index in our system. We have time control on this in the debug mode. NOW >>> THEY ARE ADDED 5 TIMES FASTER!!! >>> But in the same time the total process of indexing in our case has >>> improvement of about 8%. As our system is very big and complex I am >>> wondering if really the whole process of indexing is reduces so >>> remarkably and our system causes this slowdown or may be Lucene does >>> some optimizations on the index, merges or something else and this is >>> the reason the total process of indexing to be not so reasonably faster. >>> >>> Best Regards, >>> Ivan >>> >>> >>> >>> Uwe Goetzke wrote: >>> >>>> This week I switched the lucene library version on one customer system. >>>> The indexing speed went down from 46m32s to 16m20s for the complete task >>>> including optimisation. Great Job! >>>> We index product catalogs from several suppliers, in this case around >>>> 56.000 product groups and 360.000 products including descriptions were >>>> indexed. >>>> >>>> Regards >>>> >>>> Uwe >>>> >>>> >>>> >>>> ----------------------------------------------------------------------- >>>> Healy Hudson GmbH - D-55252 Mainz Kastel >>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 >>>> >>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger >>>> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn >>>> Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies >>>> bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. >>>> Bitte loschen Sie danach diese Email. >>>> This email is confidential. If you are not the intended recipient, you >>>> must not disclose or use this information contained in it. If you have >>>> received this email in error please tell us immediately by return email >>>> and delete the document. >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>> >>>> >>>> __________ NOD32 2913 (20080301) Information __________ >>>> >>>> This message was checked by NOD32 antivirus system. >>>> http://www.eset.com >>>> >>>> >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> ----------------------------------------------------------------------- >>> Healy Hudson GmbH - D-55252 Mainz Kastel >>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 >>> >>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger >>> sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn >>> Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies >>> bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. >>> Bitte löschen Sie danach diese Email. >>> This email is confidential. If you are not the intended recipient, you must >>> not disclose or use this information contained in it. If you have received >>> this email in error please tell us immediately by return email and delete >>> the document. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]