Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Otis Gospodnetic Tue, 25 Mar 2008 20:13:44 -0700

Sorry, I wrote this stuff, but forgot the naming.
Look: 
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: yu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 12:04:33 AM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Otis,
I checked that contrib before and could not find NgramStemFilter. Am I 
missing other contrib?
Thanks for the link!

Jay

Otis Gospodnetic wrote:
> Hi Jay,
>
> Sorry, lapsus calami, that would be Lucene *contrib*.
> Have a look:
> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Jay <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, March 25, 2008 6:15:54 PM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Sorry, I could not find the filter in the 2.3 API class list (core + 
> contrib + test). I am not ware of lucene config file either. Could you 
> please tell me where it is in 2.3 release?
>
> Thanks!
>
> Jay
>
> Otis Gospodnetic wrote:
>   
>> Jay,
>>
>> Have a look at Lucene config, it's all there, including tests.  This filter 
>> will take a token such as "foobar" and chop it up into n-grams (e.g. foobar 
>> -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram 
>> size and even min and max n-gram size.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Jay <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> I am curious what NGramStemFilter is? Is it a combination of porter 
>> stemming and word ngram identification?
>>
>> Thanks!
>>
>> Jay
>>
>> Uwe Goetzke wrote:
>>     
>>> Hi Ivan,
>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>
>>> Most data is processed by 
>>>     fTextTokenStream = result = new 
>>> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
>>> modified that ö -> oe
>>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
>>> //just a bigram tokenizer
>>>
>>> We use our own queryparser. The bigramms are searched with a tolerant 
>>> phrase query, scoring in a doc the greatest bigramms clusters covering the 
>>> phrase token. 
>>>
>>> Best Regards
>>>
>>> Uwe
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
>>> Gesendet: Freitag, 21. März 2008 16:25
>>> An: java-user@lucene.apache.org
>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Hi Uwe,
>>>
>>> Could you tell what Analyzer do you use when you marked so big indexing 
>>> speedup?
>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
>>> reason is in it. You can see the pre last report in the thread "Indexing 
>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
>>> that now is generated by JFlex instead of JavaCC.
>>> I am asking because I noticed a great speedup in adding documents to 
>>> index in our system. We have time control on this in the debug mode. NOW 
>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>> But in the same time the total process of indexing in our case has 
>>> improvement of about 8%. As our system is very big and complex I am 
>>> wondering if really the whole process of indexing is reduces so 
>>> remarkably and our system causes this slowdown or may be Lucene does 
>>> some optimizations on the index, merges or something else and this is 
>>> the reason the total process of indexing to be not so reasonably faster.
>>>
>>> Best Regards,
>>> Ivan
>>>
>>>
>>>
>>> Uwe Goetzke wrote:
>>>       
>>>> This week I switched the lucene library version on one customer system.
>>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>>> including optimisation. Great Job!
>>>> We index product catalogs from several suppliers, in this case around
>>>> 56.000 product groups and 360.000 products including descriptions were
>>>> indexed.
>>>>
>>>> Regards
>>>>
>>>> Uwe
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>>
>>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger 
>>>> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn 
>>>> Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies 
>>>> bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. 
>>>> Bitte loschen Sie danach diese Email.
>>>> This email is confidential. If you are not the intended recipient, you 
>>>> must not disclose or use this information contained in it. If you have 
>>>> received this email in error please tell us immediately by return email 
>>>> and delete the document.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>> __________ NOD32 2913 (20080301) Information __________
>>>>
>>>> This message was checked by NOD32 antivirus system.
>>>> http://www.eset.com
>>>>
>>>>
>>>>
>>>>   
>>>>         
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>> -----------------------------------------------------------------------
>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>>>
>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger 
>>> sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn 
>>> Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies 
>>> bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. 
>>> Bitte löschen Sie danach diese Email.
>>> This email is confidential. If you are not the intended recipient, you must 
>>> not disclose or use this information contained in it. If you have received 
>>> this email in error please tell us immediately by return email and delete 
>>> the document.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Reply via email to