MultiSearcher query with Sort option
Hi, I am using a MultiSearcher to search 2 indexes. As part of my query, I am sorting the results based on a field (which in NOT_ANALYSED). However, i seem to be getting hits only from one of the indexes. If I change to Sort.INDEX_ORDER, I seem to be getting results from both. Is this a know problem ? Thanks, ~preetham - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: MultiSearcher query with Sort option
Hallo Preetham, never heard of this. What Lucene version do you use? To check out, try the search in andifferent way: Combine the two indexes not into a MultiSearcher, instead open an IndexReader for both indexes and combine both readers to a MultiReader. This MultiReader can be used like a conventional single index and searched with IndexSearcher. If the error then disappears, there may be a bug. If not, something with your indexes is wrong. I always recommend to only use MultiSearcher in distributed or parallel search scenarios, never for just combining two indexes. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Preetham Kajekar [mailto:preet...@cisco.com] > Sent: Friday, April 10, 2009 9:43 AM > To: java-user@lucene.apache.org > Subject: MultiSearcher query with Sort option > > Hi, > I am using a MultiSearcher to search 2 indexes. As part of my query, I > am sorting the results based on a field (which in NOT_ANALYSED). > However, i seem to be getting hits only from one of the indexes. If I > change to Sort.INDEX_ORDER, I seem to be getting results from both. Is > this a know problem ? > > Thanks, > ~preetham > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpellChecker AlreadyClosedException issue
dir is a local variable inside a method, so it's not getting reused. Should I synchronise the whole method? I think that would slow things down in a concurrent environment. Thanks for your response. Chris Hostetter wrote: : My code looks like this: : : Directory dir = null; : try { :dir = FSDirectory.getDirectory("/path/to/dictionary"); :SpellChecker spell = new SpellChecker(dir); // exception thrown here :// ... :dir.close(); : This code works, but in a highly concurrent situation AlreadyClosedException : is being thrown when I try to instantiate the SpellChecker: : org.apache.lucene.store.AlreadyClosedException: this Directory is closed if an error only happens under high concurrent load, it suggests that perhaps you have multiple threads attempting to close the directory. you haven't clarified whether "dir" is a local variable inside a method, or an instnace variable in an object which is getting reused by multiple threads -- so it's hard to guess. : I use lucene-core-2.4.1.jar and lucene-spellchecker-2.4.1.jar and I can : reproduce the error in both windows and linux. if you have a fully exeuctable test case (instead of just an incomplete partial snippet) that you can share, people may be able to spot the problem, or at the very least run the test themselves to reproduce. -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Ioannis Cherouvim Software Engineer mail: j...@eworx.gr web: www.eworx.gr - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: MultiSearcher query with Sort option
It should, do not use Sort.INDEX_ORDER, create a SortField with indexorder and the reverse parameter, the SortField can be warpped inside a Sort instance and voila. I am not sure, if it works, but it should. Same with score. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Preetham Kajekar [mailto:preet...@cisco.com] > Sent: Friday, April 10, 2009 11:27 AM > To: java-user@lucene.apache.org > Subject: Re: MultiSearcher query with Sort option > > Hi, > I just realized it was a bug in my code. > On a related note, is it possible to Sort based on reverse index order ? > > Thanks, > ~preetham > > Uwe Schindler wrote: > > Hallo Preetham, > > > > never heard of this. What Lucene version do you use? > > To check out, try the search in andifferent way: > > Combine the two indexes not into a MultiSearcher, instead open an > > IndexReader for both indexes and combine both readers to a MultiReader. > This > > MultiReader can be used like a conventional single index and searched > with > > IndexSearcher. If the error then disappears, there may be a bug. If not, > > something with your indexes is wrong. > > > > I always recommend to only use MultiSearcher in distributed or parallel > > search scenarios, never for just combining two indexes. > > > > Uwe > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > >> -Original Message- > >> From: Preetham Kajekar [mailto:preet...@cisco.com] > >> Sent: Friday, April 10, 2009 9:43 AM > >> To: java-user@lucene.apache.org > >> Subject: MultiSearcher query with Sort option > >> > >> Hi, > >> I am using a MultiSearcher to search 2 indexes. As part of my query, I > >> am sorting the results based on a field (which in NOT_ANALYSED). > >> However, i seem to be getting hits only from one of the indexes. If I > >> change to Sort.INDEX_ORDER, I seem to be getting results from both. Is > >> this a know problem ? > >> > >> Thanks, > >> ~preetham > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > > > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Exceptions in merge thread (while optimizing) causing problems with subsequent reopens
Actually it's perfectly fine for two threads to enter that code fragment (you obtain a write lock to protect the code so that "there can be only one"). Second off, even if you didn't have your write lock, the code should still be safe in that no index corruption is possible. Multiple threads may call optimize(), commit() etc. on an IndexWriter without harm. The reopen code is also "safe" (will not cause corruption), but you may accidentally have readers that you fail to close, or close readers that are in-use by in-flight searches). I'd recommend using the "lia.admin.SearcherManager" class from the upcoming Lucene in Action revision (it's in the book's source code, which you can download from http://www.manning.com/hatcher3/LIAsourcecode.zip) to manage/reopen the searcher. But one surefire way to cause index corruption is if two separate IndexWriters are open on the same index. This is normally not easy to do, since Lucene protects itself with the write lock in the index directory. So if you 1) turn off this locking (eg use NoLockFactory), and 2) accidentally allow two writers on once on the same index, you'll get corruption. So I'm not sure that we've actually explained your corruption? Mike On Fri, Apr 10, 2009 at 12:42 AM, Khawaja Shams wrote: > Mike, > I am sorry for wasting your time :). There were indeed two threads that > were performing this operation. Out of curiosity, which part of this is not > thread safe? An indexreader reopening while a commit is going on? Thanks > again for your help. > > Regards, > Khawaja > > > > On Thu, Apr 9, 2009 at 5:44 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> That code looks right. Are there multiple threads that may enter it? >> >> Can you show the code where you create the IndexWriter, add docs, etc? >> >> Can you call IndexWriter.setInfoStream for the entire life of the >> index, up until when the optimize error happens, and post back? >> >> Mike >> >> On Thu, Apr 9, 2009 at 8:33 PM, Khawaja Shams wrote: >> > Hi Michael, >> > Thanks for the quick response. I only have one IndexWriter, and there >> are >> > no other processes accessing this particular index. I have tried deleting >> > the entire index and reconstructing it, but the index corruption is >> > repeatable. Incidentally, there are no new writes since the last commit >> when >> > the merge happens. I have over-padded my code with ReadWrite locks to >> make >> > sure that no writes/read are happening between the commits, >> optimizations, >> > and reopening of the index. >> > >> > >> > Here is a snippet of the thread I use to maintain the Index ( I hope that >> I >> > am not doing something terribly wrong): >> > while (true) { >> > try { >> > getWriteLock(); >> > indexWriter.commit(); >> > if (shouldOptimize()) { >> > indexWriter.optimize(); >> > } >> > >> > IndexReader oldIR = indexSearcher.getIndexReader(); >> > IndexReader ir = oldIR.reopen(); >> > if (ir != oldIR) { >> > IndexSearcher oldIS = indexSearcher; >> > indexSearcher = new IndexSearcher(ir); >> > oldIS.close(); >> > oldIR.close(); >> > } catch (Throwable t) { >> > trace.error(t, t); >> > } finally { >> > releaseWriteLock(); >> > } >> > >> > >> > Regards, >> > Khawaja >> > >> > On Thu, Apr 9, 2009 at 5:05 PM, Michael McCandless < >> > luc...@mikemccandless.com> wrote: >> > >> >> These are serious corruption exceptions. >> >> >> >> Is it at all possible two writers are accessing the index at the same >> time? >> >> >> >> Can you describe more about how you're using Lucene? >> >> >> >> Mike >> >> >> >> On Thu, Apr 9, 2009 at 7:59 PM, Khawaja Shams >> wrote: >> >> > Hello, >> >> > I am having a problem with reopening the IndexReader with Lucene 2.4 >> ( I >> >> > updated to 2.4.1, but still no luck). The exception is preceded by an >> >> > exception in optimizing the index. I am not reopening the reader while >> >> the >> >> > commit or optimization is going on in the writer (optimizing happens >> in >> >> the >> >> > same thread, but much less often). The issues go away once I turn off >> >> > optimizations. I was also getting this problem before I turned off the >> >> use >> >> > of compound files. I would appreciate any guidance. >> >> > >> >> > Thanks! >> >> > >> >> > Regards, >> >> > Khawaja >> >> > >> >> > >> >> > 2009-04-09 15:57:47,033 (941820) [Index Maint Thread] ERROR >> >> > gov.nasa.ensemble.core.indexer.Indexer - java.io.IOException: >> background >> >> > merge hit exception: _8:C41258 _9:C11382 into _a [optimize] >> >> > java.io.IOException: background merge hit exception: _8:C41258 >> _9:C11382 >> >> > into _a [optimize] >> >> > at >>
Re: MultiSearcher query with Sort option
This (reversing a SortField.FIELD_DOC) should work... if it doesn't it's a bug. SortField.FIELD_DOC and SortField.FIELD_SCORE are "first class" SortField objects. Mike On Fri, Apr 10, 2009 at 5:31 AM, Uwe Schindler wrote: > It should, do not use Sort.INDEX_ORDER, create a SortField with indexorder > and the reverse parameter, the SortField can be warpped inside a Sort > instance and voila. I am not sure, if it works, but it should. Same with > score. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -Original Message- >> From: Preetham Kajekar [mailto:preet...@cisco.com] >> Sent: Friday, April 10, 2009 11:27 AM >> To: java-user@lucene.apache.org >> Subject: Re: MultiSearcher query with Sort option >> >> Hi, >> I just realized it was a bug in my code. >> On a related note, is it possible to Sort based on reverse index order ? >> >> Thanks, >> ~preetham >> >> Uwe Schindler wrote: >> > Hallo Preetham, >> > >> > never heard of this. What Lucene version do you use? >> > To check out, try the search in andifferent way: >> > Combine the two indexes not into a MultiSearcher, instead open an >> > IndexReader for both indexes and combine both readers to a MultiReader. >> This >> > MultiReader can be used like a conventional single index and searched >> with >> > IndexSearcher. If the error then disappears, there may be a bug. If not, >> > something with your indexes is wrong. >> > >> > I always recommend to only use MultiSearcher in distributed or parallel >> > search scenarios, never for just combining two indexes. >> > >> > Uwe >> > >> > - >> > Uwe Schindler >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > http://www.thetaphi.de >> > eMail: u...@thetaphi.de >> > >> > >> >> -Original Message- >> >> From: Preetham Kajekar [mailto:preet...@cisco.com] >> >> Sent: Friday, April 10, 2009 9:43 AM >> >> To: java-user@lucene.apache.org >> >> Subject: MultiSearcher query with Sort option >> >> >> >> Hi, >> >> I am using a MultiSearcher to search 2 indexes. As part of my query, I >> >> am sorting the results based on a field (which in NOT_ANALYSED). >> >> However, i seem to be getting hits only from one of the indexes. If I >> >> change to Sort.INDEX_ORDER, I seem to be getting results from both. Is >> >> this a know problem ? >> >> >> >> Thanks, >> >> ~preetham >> >> >> >> - >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > >> > >> > >> > - >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MultiSearcher query with Sort option
Hi Uwe, Thanks for your response. However, I could not find the API in SortField and Sort to achieve this. SortField can be wrapped inside a Sort, but you cannot specify to reverse the order . Thx, ~preetham Uwe Schindler wrote: It should, do not use Sort.INDEX_ORDER, create a SortField with indexorder and the reverse parameter, the SortField can be warpped inside a Sort instance and voila. I am not sure, if it works, but it should. Same with score. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Preetham Kajekar [mailto:preet...@cisco.com] Sent: Friday, April 10, 2009 11:27 AM To: java-user@lucene.apache.org Subject: Re: MultiSearcher query with Sort option Hi, I just realized it was a bug in my code. On a related note, is it possible to Sort based on reverse index order ? Thanks, ~preetham Uwe Schindler wrote: Hallo Preetham, never heard of this. What Lucene version do you use? To check out, try the search in andifferent way: Combine the two indexes not into a MultiSearcher, instead open an IndexReader for both indexes and combine both readers to a MultiReader. This MultiReader can be used like a conventional single index and searched with IndexSearcher. If the error then disappears, there may be a bug. If not, something with your indexes is wrong. I always recommend to only use MultiSearcher in distributed or parallel search scenarios, never for just combining two indexes. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Preetham Kajekar [mailto:preet...@cisco.com] Sent: Friday, April 10, 2009 9:43 AM To: java-user@lucene.apache.org Subject: MultiSearcher query with Sort option Hi, I am using a MultiSearcher to search 2 indexes. As part of my query, I am sorting the results based on a field (which in NOT_ANALYSED). However, i seem to be getting hits only from one of the indexes. If I change to Sort.INDEX_ORDER, I seem to be getting results from both. Is this a know problem ? Thanks, ~preetham - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MultiSearcher query with Sort option
Hi, I just realized it was a bug in my code. On a related note, is it possible to Sort based on reverse index order ? Thanks, ~preetham Uwe Schindler wrote: Hallo Preetham, never heard of this. What Lucene version do you use? To check out, try the search in andifferent way: Combine the two indexes not into a MultiSearcher, instead open an IndexReader for both indexes and combine both readers to a MultiReader. This MultiReader can be used like a conventional single index and searched with IndexSearcher. If the error then disappears, there may be a bug. If not, something with your indexes is wrong. I always recommend to only use MultiSearcher in distributed or parallel search scenarios, never for just combining two indexes. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Preetham Kajekar [mailto:preet...@cisco.com] Sent: Friday, April 10, 2009 9:43 AM To: java-user@lucene.apache.org Subject: MultiSearcher query with Sort option Hi, I am using a MultiSearcher to search 2 indexes. As part of my query, I am sorting the results based on a field (which in NOT_ANALYSED). However, i seem to be getting hits only from one of the indexes. If I change to Sort.INDEX_ORDER, I seem to be getting results from both. Is this a know problem ? Thanks, ~preetham - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MultiSearcher query with Sort option
Hi, I found the API in another post on the net. new *Sort*(new SortField(null, SortField.DOC, true)) The trick is to set the field to null. Thanks for the help. Preetham Kajekar wrote: Hi Uwe, Thanks for your response. However, I could not find the API in SortField and Sort to achieve this. SortField can be wrapped inside a Sort, but you cannot specify to reverse the order . Thx, ~preetham Uwe Schindler wrote: It should, do not use Sort.INDEX_ORDER, create a SortField with indexorder and the reverse parameter, the SortField can be warpped inside a Sort instance and voila. I am not sure, if it works, but it should. Same with score. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Preetham Kajekar [mailto:preet...@cisco.com] Sent: Friday, April 10, 2009 11:27 AM To: java-user@lucene.apache.org Subject: Re: MultiSearcher query with Sort option Hi, I just realized it was a bug in my code. On a related note, is it possible to Sort based on reverse index order ? Thanks, ~preetham Uwe Schindler wrote: Hallo Preetham, never heard of this. What Lucene version do you use? To check out, try the search in andifferent way: Combine the two indexes not into a MultiSearcher, instead open an IndexReader for both indexes and combine both readers to a MultiReader. This MultiReader can be used like a conventional single index and searched with IndexSearcher. If the error then disappears, there may be a bug. If not, something with your indexes is wrong. I always recommend to only use MultiSearcher in distributed or parallel search scenarios, never for just combining two indexes. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Preetham Kajekar [mailto:preet...@cisco.com] Sent: Friday, April 10, 2009 9:43 AM To: java-user@lucene.apache.org Subject: MultiSearcher query with Sort option Hi, I am using a MultiSearcher to search 2 indexes. As part of my query, I am sorting the results based on a field (which in NOT_ANALYSED). However, i seem to be getting hits only from one of the indexes. If I change to Sort.INDEX_ORDER, I seem to be getting results from both. Is this a know problem ? Thanks, ~preetham - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query any data
I think I would tackle this in a slightly different manner. When you are creating this index, make sure that that field has a default value. Make sure this value is something that could never appear in the index otherwise. Then, when you goto place this field into the index, either write out your actual value, or the default one. Then when you get the document back, you can look at that field, and solve your question. You can also craft queries that specifically avoid entries that don't have a value in this field with a not clause. Hope this helps, Matt Erick Erickson wrote: > searching for fieldname:* will be *extremely* expensive as it will, by > default, > build a giant OR clause consisting of every term in the field. You'll throw > MaxClauses exceptions right and left. I'd follow Tim's thread lead first > > Best > Erick > > 2009/4/8 王巍巍 > > >> first you should change your querypaser to accept wildcard query by calling >> method of QueryParser >> setAllowLeadingWildcard >> then you can query like this: fieldname:* >> >> 2009/4/9 Tim Williams >> >> >>> On Wed, Apr 8, 2009 at 11:45 AM, addman wrote: >>> Hi, Is it possible to create a query to search a field for any value? I >>> just >>> need to know if the optional field contain any data at all. >>> google for: lucene field existence >>> >>> There's no way built in, one strategy[1] is to have a 'meta field' >>> that contains the names of the fields the document contains. >>> >>> --tim >>> >>> [1] - >>> http://www.mail-archive.com/lucene-u...@jakarta.apache.org/msg07703.html >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> -- >> 王巍巍(Weiwei Wang) >> Department of Computer Science >> Gulou Campus of Nanjing University >> Nanjing, P.R.China, 210093 >> >> Mobile: 86-13913310569 >> MSN: ww.wang...@gmail.com >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> >> > > -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
SpellChecker in use with composite query
Hi I have been playing around with the SpellChecker class and so far it looks really good. While developing a testcase to show it working I came across a couple of issues which I have resolved but I'm not certain if this is the correct approach. I would therefore be grateful if anyone could tell me whether it is correct or I should try something else. 1) Multple Indexes: I have multiple indexes which store different documents based on certain subject matter. So inorder to perform the spellchecking against all indexes I did something like this: IndexReader spellReader = IndexReader.open(fsDirectory1); IndexReader spellReader2 = IndexReader.open(fsDirectory2); MultiReader multiReader = new MultiReader(new IndexReader[] {spellReader,spellReader2}); LuceneDictionary luceneDictionary = new LuceneDictionary(multiReader, "content"); Directory spellDirectory = FSDirectory.getDirectory(
Re: Query any data
2009/4/10 Matthew Hall : > I think I would tackle this in a slightly different manner. > > When you are creating this index, make sure that that field has a > default value. Make sure this value is something that could never appear > in the index otherwise. Then, when you goto place this field into the > index, either write out your actual value, or the default one. > > Then when you get the document back, you can look at that field, and > solve your question. You can also craft queries that specifically avoid > entries that don't have a value in this field with a not clause. I think this is limited by... ... not being able to [easily] add new fields over time... you'd have to reindex all documents (to insert the new magic token) just to add a new field. ... requiring additional manipulation for appendable, updateable fields... when you append new data to a field, you'd have to go in and remove the special token. --tim - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Wordnet indexing error
Thanks Otis, Yes, we figured that out! Since, we do not intend to migrate to 2.4 yet, we used the syns2index source code from svn. The problem is now taken care. This part is for all: This brings us to the next question: 1. Is there some contrib code available for using hypernyms and such, in addition to synonyms from Wordnet? 2. Is there some code to add user defined dictionary/ontology, as an additional layer to Wordnet (some sort of multi-level)? Thank you all in advance, Sincerely, Sithu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Thursday, April 09, 2009 1:06 AM To: java-user@lucene.apache.org Subject: Re: Wordnet indexing error Hi, The simplest thing to do is to grab the latest Lucene and the latest jar for that Wordnet (syns2index) code. That should work for you (that UnIndexed method is an old method that doesn't exist any more). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "Sudarsan, Sithu D." > To: java-user@lucene.apache.org > Sent: Wednesday, April 8, 2009 7:01:16 PM > Subject: Wordnet indexing error > > Hi All, > > We're using Lucene 2.3.2 on Windows. When we try to generate index for > WordNet2.0 using Syns2Index class, while indexing, the following error > is thrown: > > Java.lang.NoSuchMethodError: > org.apache.lucene.document.Field.UnIndexed(Ljava/lang/String;Ljava/lang/ > String;)Lorg/apache/lucene/document/Field; > > Our code is looks like this: > > String[] filelocations = {"path/to/prolog/file", "path/to/index"}; > try{ > Syns2Index.main(filelocations); > } catch > > > The error typically happens at about line number 13 in the wn_s.pl > file. > > No luck with WordNet3.0 as well. We get the same error. > > Any fix or solutions? > > Thanks in advance, > Sithu D Sudarsan > > sithu.sudar...@fda.hhs.gov > sdsudar...@ualr.edu - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RangeFilter performance problem using MultiReader
Hi, we are experiencing some problems using RangeFilters and we think there are some performance issues caused by MultiReader. We have more or less 3M documents in 24 indexes and we read all of them using a MultiReader. If we do a search using only terms, there are no problems, but it if we add to the same search terms a RangeFilter that extracts a large subset of the documents (e.g. 500K), it takes a lot of time to execute (about 15s). In order to identify the problem, we have tried to consolidate the index: so now we have the same 3M docs in a single 10GB index. If we repeat the same search using this index, it takes only a small fraction of the previous time (about 2s). Is there something we can do to improve search performance using RangeFilters with MultiReader or the only solution is to have only a single big index? Thanks, Raf
Re: RangeFilter performance problem using MultiReader
Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers. I think the only workaround is to merge your indexes down to a single index. But, Lucene trunk (not yet released) has fixed this, so that searching through your MultiReader should give you the same performance as searching on a single consolidated index -- if you test this (which would be awesome!) please report back and let us know how it went. Mike On Fri, Apr 10, 2009 at 10:38 AM, Raf wrote: > Hi, > we are experiencing some problems using RangeFilters and we think there are > some performance issues caused by MultiReader. > > We have more or less 3M documents in 24 indexes and we read all of them > using a MultiReader. > If we do a search using only terms, there are no problems, but it if we add > to the same search terms a RangeFilter that extracts a large subset of the > documents (e.g. 500K), it takes a lot of time to execute (about 15s). > > In order to identify the problem, we have tried to consolidate the index: so > now we have the same 3M docs in a single 10GB index. > If we repeat the same search using this index, it takes only a small > fraction of the previous time (about 2s). > > Is there something we can do to improve search performance using > RangeFilters with MultiReader or the only solution is to have only a single > big index? > > Thanks, > Raf > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
On Fri, Apr 10, 2009 at 10:48 AM, Michael McCandless wrote: > Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms > (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers. Do we know why this is, and if it's fixable (the MultiTermEnum, not the higher level query objects)? Is it simply the maintenance of the priority queue, or something else? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Help to determine why an optimized index is proportionaly too big.
Chris Hostetter wrote: : The second stage index failed an optimization with a disk full exception : (I had to move it to another lucene machine with a larger disk partition : to complete the optimization. Is there a reason why a 22 day index would : be 10x the size of an 8 day index when the document indexing rate is : fairly constant? Also, is there a way to shrink the index without : regenerating it? did you run CheckIndex after it failed to optimize the first time? the failure may have left old temp files arround that aren't actually part of the index but are taking up space. (Actually: does CheckIndex warn about unused files in the index directory so people can clean them up? i'm not sure) It doesn't. But Luke has a function to do this. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
Hi Mike, thank you for your answer. I have downloaded lucene-core-2.9-dev and I have executed my tests (both on multireader and on consolidated index) using this new version, but the performance are very similar to the previous ones. The big index is 7/8 times faster than multireader version. Raf On Fri, Apr 10, 2009 at 4:48 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms > (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers. > I think the only workaround is to merge your indexes down to a single > index. > > But, Lucene trunk (not yet released) has fixed this, so that searching > through your MultiReader should give you the same performance as > searching on a single consolidated index -- if you test this (which > would be awesome!) please report back and let us know how it went. > > Mike > > On Fri, Apr 10, 2009 at 10:38 AM, Raf wrote: > > Hi, > > we are experiencing some problems using RangeFilters and we think there > are > > some performance issues caused by MultiReader. > > > > We have more or less 3M documents in 24 indexes and we read all of them > > using a MultiReader. > > If we do a search using only terms, there are no problems, but it if we > add > > to the same search terms a RangeFilter that extracts a large subset of > the > > documents (e.g. 500K), it takes a lot of time to execute (about 15s). > > > > In order to identify the problem, we have tried to consolidate the index: > so > > now we have the same 3M docs in a single 10GB index. > > If we repeat the same search using this index, it takes only a small > > fraction of the previous time (about 2s). > > > > Is there something we can do to improve search performance using > > RangeFilters with MultiReader or the only solution is to have only a > single > > big index? > > > > Thanks, > > Raf > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Lucene SnowBall unexpected behavior for some terms
Hello, I was working with lucene snowball 2.3.2 and I switch to 2.4.0. After switch I came by to some case where lucene doesn't do lemmatization correctly. So far I found only one case spa - spas. spas are not getting lemmatize at all... BTW I saw the same behavior on solr 1.3 Anybody have any idea why? -- View this message in context: http://www.nabble.com/Lucene-SnowBall-unexpected-behavior-for-some-terms-tp22991689p22991689.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
On Fri, Apr 10, 2009 at 11:03 AM, Yonik Seeley wrote: > On Fri, Apr 10, 2009 at 10:48 AM, Michael McCandless > wrote: >> Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms >> (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers. > > Do we know why this is, and if it's fixable (the MultiTermEnum, not > the higher level query objects)? Is it simply the maintenance of the > priority queue, or something else? We never fully explained it, but we have some ideas... It's only if you iterate each term, and do a TermDocs.seek for each, that Multi*Reader seems to show the problem. Just iterating the terms seems OK (I have a 51 segment index, and I can iterate ~ 10M unique terms in ~8 seconds). But loading FieldCache, or doing eg RangeQuery, also does a MultiTermDocs.seek on each term, which in turn calls SegmentTermDocs.seek for each of the sub-readers in sequence. I *think* maybe for highly unique terms, where typically all segments but one actually have the term, the cost of invoking seek on those segments without the term is high. Really, somehow, we want to only call seek on those segments that have the term, which we know from the pqueue... Mike > -Yonik > http://www.lucidimagination.com > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
On Fri, Apr 10, 2009 at 1:20 PM, Raf wrote: > Hi Mike, > thank you for your answer. > > I have downloaded lucene-core-2.9-dev and I have executed my tests (both on > multireader and on consolidated index) using this new version, but the > performance are very similar to the previous ones. > The big index is 7/8 times faster than multireader version. Hmmm, interesting! Can you provide more details about your tests? EG the code fragment showing your query, the creation of the MultiReader, how you run the search, etc.? Is the field that you're applying the RangeFilter on highly unique or rather redundant? Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
When I did some profiling I saw that the slow down came from tons of extra seeks (single segment vs multisegment). What was happening was, the first couple segments would have thousands of terms for the field, but as the segments logarithmically shrank in size, the number of terms for the segment would drop dramatically - you basically end up with a long tail eg 5000 4000 200 200 5 5 2. Because loading the field cache would enumerate every term it would end up calling seek 5000 times against each segment - that appeared to be the slowdown for me. We fixed this with 1483 because we load the fieldcache per segment and so instead of calling seek 5000 times for each segment, you call 5000 for the fist, 4000 for the next, then 200, 200, 5 and 5. Can add up to huge savings do the long tail of low term segments. I had thought we would also see the advantage with multi-term queries - you rewrite against each segment and avoid extra seeks (though not nearly as many as when enumerating every term). As Mike pointed out to me back when though : we still rewrite against the multi-reader and so see no real savings here. Unfortunately. - Mark Do we know why this is, and if it's fixable (the MultiTermEnum, not the higher level query objects)? Is it simply the maintenance of the priority queue, or something else? We never fully explained it, but we have some ideas... It's only if you iterate each term, and do a TermDocs.seek for each, that Multi*Reader seems to show the problem. Just iterating the terms seems OK (I have a 51 segment index, and I can iterate ~ 10M unique terms in ~8 seconds). But loading FieldCache, or doing eg RangeQuery, also does a MultiTermDocs.seek on each term, which in turn calls SegmentTermDocs.seek for each of the sub-readers in sequence. I *think* maybe for highly unique terms, where typically all segments but one actually have the term, the cost of invoking seek on those segments without the term is high. Really, somehow, we want to only call seek on those segments that have the term, which we know from the pqueue... Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
On Fri, Apr 10, 2009 at 2:32 PM, Mark Miller wrote: > I had thought we would also see the advantage with multi-term queries - you > rewrite against each segment and avoid extra seeks (though not nearly as > many as when enumerating every term). As Mike pointed out to me back when > though : we still rewrite against the multi-reader and so see no real > savings here. Unfortunately. But, RangeQuery.rewrite is simply enumerating terms, which I think is working "OK". It's enumerting terms, then seeking a sister TermDocs to each term, that tickles the over-seeking problem. FieldCache does that, and RangeFilter on 2.4 does that, but RangeFilter (or RangeQuery with constant score mode) on 2.9 should not (they should do it per segment), which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
Michael McCandless wrote: On Fri, Apr 10, 2009 at 2:32 PM, Mark Miller wrote: I had thought we would also see the advantage with multi-term queries - you rewrite against each segment and avoid extra seeks (though not nearly as many as when enumerating every term). As Mike pointed out to me back when though : we still rewrite against the multi-reader and so see no real savings here. Unfortunately. But, RangeQuery.rewrite is simply enumerating terms, which I think is working "OK". It's enumerting terms, then seeking a sister TermDocs to each term, that tickles the over-seeking problem. FieldCache does that, and RangeFilter on 2.4 does that, but RangeFilter (or RangeQuery with constant score mode) on 2.9 should not (they should do it per segment), which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike Ah, right - anything utilizing a filter will see the gain. It wouldn't be such a big gain unless there were a *lot* of matching terms though right? Fieldcache is so bad because its every term. A smaller percentage of terms for a field won't be nearly the problem. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
Michael McCandless wrote: which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike Another point is that he may not have such a nasty set of segments - Raf says he has 24 indexes, which sounds like he may not have the logarithmic sizing you normally see. If you have somewhat normal term distribution for all 24 segments, the problem is not exasperated nearly as much (along with not being so bad as its not using all of the terms for the field). 24 segments is bound to be quite a bit slower than an optimized index for most things - also 24 segments of similar size may also be worse than the normal 24 segments with log dropping size. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
Mark Miller wrote: Michael McCandless wrote: which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike Another point is that he may not have such a nasty set of segments - Raf says he has 24 indexes, which sounds like he may not have the logarithmic sizing you normally see. If you have somewhat normal term distribution for all 24 segments, the problem is not exasperated nearly as much (along with not being so bad as its not using all of the terms for the field). Better clarify this: it will still be a problem - you still have all the extra seeks - but they are not as many wasted seeks that we can avoid like the problem with the tailed logarithmic segments. 24 segments is bound to be quite a bit slower than an optimized index for most things - also 24 segments of similar size may also be worse than the normal 24 segments with log dropping size. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
Raf wrote: We have more or less 3M documents in 24 indexes and we read all of them using a MultiReader. Is this a multireader containing multireaders? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
On Fri, Apr 10, 2009 at 3:14 PM, Mark Miller wrote: > Raf wrote: >> >> We have more or less 3M documents in 24 indexes and we read all of them >> using a MultiReader. >> > > Is this a multireader containing multireaders? Let's hear Raf's answer, but I think likely "yes". But this shouldn't be a problem because we recursively expand down to the segment readers in IndexSearcher.gatherSubReaders. Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
On Fri, Apr 10, 2009 at 3:11 PM, Mark Miller wrote: > Mark Miller wrote: >> >> Michael McCandless wrote: >>> >>> which is why I'm baffled that Raf didn't see a speedup on >>> upgrading. >>> >>> Mike >>> >> >> Another point is that he may not have such a nasty set of segments - Raf >> says he has 24 indexes, which sounds like he may not have the logarithmic >> sizing you normally see. If you have somewhat normal term distribution for >> all 24 segments, the problem is not exasperated nearly as much (along with >> not being so bad as its not using all of the terms for the field). > > Better clarify this: it will still be a problem - you still have all the > extra seeks - but they are not as many wasted seeks that we can avoid like > the problem with the tailed logarithmic segments. Right, I think "uniqueness" of terms may be the driving factor. So, if segment sizes are all the same (no logarithmic tail), but terms are very unique, you'll still have N-1 SegmentTermEnums trying to seek to a term that they don't have. Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RangeFilter performance problem using MultiReader
On Fri, Apr 10, 2009 at 3:06 PM, Mark Miller wrote: > 24 segments is bound to be quite a bit slower than an optimized index for > most things I'd be curious just how true this really is (in general)... my guess is the "long tail of tiny segments" gets into the OS's IO cache (as long as the system stays hot) and doesn't actually hurt things much. Has anyone tested this (performance of unoptimized vs optimized indexes, in general) recently? To be a fair comparison, there should be no deletions in the index. Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
exponential boosts
I need to have a scoring model of the form: s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN where "d" is a document, "q" is a query, "sK" is a scoring function, and "aK" is the exponential boost factor for that scoring function. As a simple example, I might have: s1 = TF-IDF score matching "text" field (e.g. a TermQuery) a1 = 1.0 s2 = TF-IDF score matching "author" field (e.g. a TermQuery) a2 = 0.1 s3 = PageRank score (e.g. a FieldScoreQuery) a3 = 0.5 It's important that the "aK" parameters are exponents in the scoring function and not just multipliers because it allows me to do a particular kind of optimized search for the best parameter values. How can I achieve this? My first thought was just that I should set the boost factor for each query, but the boost factor is just a multiplier, right? My second thought was to subclass CustomScoreQuery and override customScore, but as far as I can tell, CustomScoreQuery can only combine a Query with a ValueSourceQuery, while I need to combine a Query with another Query (e.g. the example above with two TermQuery scores). How should I go about this? Thanks in advance, Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: exponential boosts
Perhaps you'd find it easier to implement the equivalent: log(s1(d, q))*a1 + ... + log(sN(d, q))*aN On Fri, Apr 10, 2009 at 12:56 PM, Steven Bethard wrote: > I need to have a scoring model of the form: > >s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN > > where "d" is a document, "q" is a query, "sK" is a scoring function, and > "aK" is the exponential boost factor for that scoring function. As a > simple example, I might have: > >s1 = TF-IDF score matching "text" field (e.g. a TermQuery) >a1 = 1.0 > >s2 = TF-IDF score matching "author" field (e.g. a TermQuery) >a2 = 0.1 > >s3 = PageRank score (e.g. a FieldScoreQuery) >a3 = 0.5 > > It's important that the "aK" parameters are exponents in the scoring > function and not just multipliers because it allows me to do a > particular kind of optimized search for the best parameter values. > > How can I achieve this? My first thought was just that I should set the > boost factor for each query, but the boost factor is just a multiplier, > right? > > My second thought was to subclass CustomScoreQuery and override > customScore, but as far as I can tell, CustomScoreQuery can only combine > a Query with a ValueSourceQuery, while I need to combine a Query with > another Query (e.g. the example above with two TermQuery scores). > > How should I go about this? > > Thanks in advance, > > Steve > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
RE: RangeFilter performance problem using MultiReader
You got a lot of answers and questions about your index structure. Now another idea, maybe this helps you to speed up your RangeFilter: What type of range do you want to query? From your index statistics, it looks like a numeric/date field from which you filter very large ranges. If the values are very fine-grained and so you hit a lot of terms for the range, you might consider using TrieRangeFilter, which is a new contrib module in the yet unreleased Lucene 2.9: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apach e/lucene/search/trie/package-summary.html The name and API may change before release (if it moves to core), but you can try it out, it is working stable and currently runs in productive websites! It works for int, long, double, float and Date values (if encoded using Date.getTime() as long). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Raf [mailto:r.ventag...@gmail.com] > Sent: Friday, April 10, 2009 4:38 PM > To: java-user@lucene.apache.org > Subject: RangeFilter performance problem using MultiReader > > Hi, > we are experiencing some problems using RangeFilters and we think there > are > some performance issues caused by MultiReader. > > We have more or less 3M documents in 24 indexes and we read all of them > using a MultiReader. > If we do a search using only terms, there are no problems, but it if we > add > to the same search terms a RangeFilter that extracts a large subset of the > documents (e.g. 500K), it takes a lot of time to execute (about 15s). > > In order to identify the problem, we have tried to consolidate the index: > so > now we have the same 3M docs in a single 10GB index. > If we repeat the same search using this index, it takes only a small > fraction of the previous time (about 2s). > > Is there something we can do to improve search performance using > RangeFilters with MultiReader or the only solution is to have only a > single > big index? > > Thanks, > Raf - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: exponential boosts
On 4/10/2009 1:08 PM, Jack Stahl wrote: > Perhaps you'd find it easier to implement the equivalent: > > log(s1(d, q))*a1 + ... + log(sN(d, q))*aN Yes, that's fine too - that's actually what I'd be optimizing anyway. But how would I do that? If I took the query boost route, how do I get a TermQuery to produce a score in log-space, while keeping the boost in regular space? Or if I took the CustomScoreQuery route, how do I combine two Query scores (not a Query score and a ValueSourceQuery score)? Steve > > On Fri, Apr 10, 2009 at 12:56 PM, Steven Bethard wrote: > >> I need to have a scoring model of the form: >> >>s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN >> >> where "d" is a document, "q" is a query, "sK" is a scoring function, and >> "aK" is the exponential boost factor for that scoring function. As a >> simple example, I might have: >> >>s1 = TF-IDF score matching "text" field (e.g. a TermQuery) >>a1 = 1.0 >> >>s2 = TF-IDF score matching "author" field (e.g. a TermQuery) >>a2 = 0.1 >> >>s3 = PageRank score (e.g. a FieldScoreQuery) >>a3 = 0.5 >> >> It's important that the "aK" parameters are exponents in the scoring >> function and not just multipliers because it allows me to do a >> particular kind of optimized search for the best parameter values. >> >> How can I achieve this? My first thought was just that I should set the >> boost factor for each query, but the boost factor is just a multiplier, >> right? >> >> My second thought was to subclass CustomScoreQuery and override >> customScore, but as far as I can tell, CustomScoreQuery can only combine >> a Query with a ValueSourceQuery, while I need to combine a Query with >> another Query (e.g. the example above with two TermQuery scores). >> >> How should I go about this? >> >> Thanks in advance, >> >> Steve >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Sequential match query
Hello, I have 3 terms and I want to much them in order I tried to use wildcard query I am not getting any results back Terms: A C F Doc: name:A B C D E F query: name:A*C*F I am not getting any results back, Please any suggestions? Thanks for help in advance -- View this message in context: http://www.nabble.com/Sequential-match-query-tp22995240p22995240.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Different Analyzer for different fields in the same document
Hello, There is any way that a single document fields can have different analyzers for different fields? I think one way of doing it to create custom analyzer which will do field spastic analyzes.. Any other suggestions? -- View this message in context: http://www.nabble.com/Different-Analyzer-for-different-fields-in-the-same-document-tp22995442p22995442.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Different Analyzer for different fields in the same document
John Seer wrote: Hello, There is any way that a single document fields can have different analyzers for different fields? I think one way of doing it to create custom analyzer which will do field spastic analyzes.. Any other suggestions? There is PerFieldAnalyzerWrapper http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html Koji - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: exponential boosts
On 4/10/2009 12:56 PM, Steven Bethard wrote: > I need to have a scoring model of the form: > > s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN > > where "d" is a document, "q" is a query, "sK" is a scoring function, and > "aK" is the exponential boost factor for that scoring function. As a > simple example, I might have: > > s1 = TF-IDF score matching "text" field (e.g. a TermQuery) > a1 = 1.0 > > s2 = TF-IDF score matching "author" field (e.g. a TermQuery) > a2 = 0.1 > > s3 = PageRank score (e.g. a FieldScoreQuery) > a3 = 0.5 > > It's important that the "aK" parameters are exponents in the scoring > function and not just multipliers because it allows me to do a > particular kind of optimized search for the best parameter values. > > How can I achieve this? My first thought was just that I should set the > boost factor for each query, but the boost factor is just a multiplier, > right? > > My second thought was to subclass CustomScoreQuery and override > customScore, but as far as I can tell, CustomScoreQuery can only combine > a Query with a ValueSourceQuery, while I need to combine a Query with > another Query (e.g. the example above with two TermQuery scores). My third thought was to create a wrapper class that takes a Query and an exponential boost factor. The wrapper class would delegate to the Query for all methods except .weight(). For .weight(), it would return a Weight wrapper that delegated to the Weight for all methods except .getValue(). For .getValue(), it would return the original value, raised to the appropriate exponent. But will that really work, or am I going to mess up the normalization or something else? Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Different Analyzer for different fields in the same document
Thanks this is useful class for future... Koji Sekiguchi-2 wrote: > > John Seer wrote: >> Hello, >> There is any way that a single document fields can have different >> analyzers >> for different fields? >> >> I think one way of doing it to create custom analyzer which will do field >> spastic analyzes.. >> >> Any other suggestions? >> > > There is PerFieldAnalyzerWrapper > http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html > > Koji > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Different-Analyzer-for-different-fields-in-the-same-document-tp22995442p22996572.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org