NPE in IndexReader
Hello while calling IndexReader.deletedoc(int) I am becomming a NPE. java.lang.NullPointerException at org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java:658) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:686) In the acquireWriteLock methode there is call 'segmentInfos.getVersion()', but segmentInfos should be 'null'. I am working with the head revision from SVN. May someone tell me a work arround. regards, Éric Louvard. -- Mit freundlichen Grüßen i. A. Éric Louvard HAUK & SASKO Ingenieurgesellschaft mbH Zettachring 2 D-70567 Stuttgart Phone: +49 7 11 7 25 89 - 19 Fax: +49 7 11 7 25 89 - 50 E-Mail: [EMAIL PROTECTED] www: www.hauk-sasko.de Niederlassung Stuttgart Firmensitz: Markstr. 77, 44801 Bochum Registergericht: Amtsgericht Bochum, HRB 2532 Geschäftsführer: Dr.-Ing. Pavol Sasko - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene suggest
Hello i would like to implement a suggest implementation (like google suggest) using lucene. i actually tried using lucene and it was successfull but i was stuck in some point which is returning a list of results to the user that have no duplicates. my question is: is there any way i can remove duplicates with that r returned from the search in the hits or i should manage it manually ?? thanks in advance Yours Heba - Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games.
RE: lucene suggest
Hello Heba, you need some lucene field that serves as an identifier for your documents that are indexed. Then, when re-indexing some documents, you can first use the identifier to delete the old indexed documents. You have to take care of this yourself. Regards Ard > > Hello > i would like to implement a suggest implementation (like > google suggest) using lucene. i actually tried using lucene > and it was successfull but i was stuck in some point which is > returning a list of results to the user that have no > duplicates. my question is: is there any way i can remove > duplicates with that r returned from the search in the hits > or i should manage it manually ?? > > > thanks in advance > > > Yours > > Heba > > > - > Moody friends. Drama queens. Your life? Nope! - their life, > your story. > Play Sims Stories at Yahoo! Games. > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene suggest
Hi, Yes there are ways and workarounds to remove duplicates based on one field. But, you should not need this if you don't index duplicates at the first place. Just put a call to "delete" from index right before you add the document to in. Best Regards, Kapil Chhabra -Original Message- From: Heba Farouk [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 21, 2007 3:30 PM To: Lucene mailing list Subject: lucene suggest Hello i would like to implement a suggest implementation (like google suggest) using lucene. i actually tried using lucene and it was successfull but i was stuck in some point which is returning a list of results to the user that have no duplicates. my question is: is there any way i can remove duplicates with that r returned from the search in the hits or i should manage it manually ?? thanks in advance Yours Heba - Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene suggest
the documents are not duplicated, i mean the hits (assume that 2 documents have the same subject but with different authors, so if i'm searching the subject, the returned hits will have duplicates ) i was asking if i can remove duplicates from the hits?? thanks in advance Ard Schrijvers <[EMAIL PROTECTED]> wrote: Hello Heba, you need some lucene field that serves as an identifier for your documents that are indexed. Then, when re-indexing some documents, you can first use the identifier to delete the old indexed documents. You have to take care of this yourself. Regards Ard > > Hello > i would like to implement a suggest implementation (like > google suggest) using lucene. i actually tried using lucene > and it was successfull but i was stuck in some point which is > returning a list of results to the user that have no > duplicates. my question is: is there any way i can remove > duplicates with that r returned from the search in the hits > or i should manage it manually ?? > > > thanks in advance > > > Yours > > Heba > > > - > Moody friends. Drama queens. Your life? Nope! - their life, > your story. > Play Sims Stories at Yahoo! Games. > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Yours Heba - Shape Yahoo! in your own image. Join our Network Research Panel today!
RE: lucene suggest
:S If you have two hits, you have it two times in your index, simple as that. So, you have two lucene Documents with the same subject, but with different authors, and your search for subject, well, obviously you get 2 hits. Check out luke, it will help you understand your index. Furthermore, you probably want to index the document ones, and just add to times the same lucene field with a different user. Ard > > the documents are not duplicated, i mean the hits (assume > that 2 documents have the same subject but with different > authors, so if i'm searching the subject, the returned hits > will have duplicates ) > i was asking if i can remove duplicates from the hits?? > > thanks in advance > > Ard Schrijvers <[EMAIL PROTECTED]> wrote: Hello Heba, > > you need some lucene field that serves as an identifier for > your documents that are indexed. Then, when re-indexing some > documents, you can first use the identifier to delete the old > indexed documents. You have to take care of this yourself. > > Regards Ard > > > > > Hello > > i would like to implement a suggest implementation (like > > google suggest) using lucene. i actually tried using lucene > > and it was successfull but i was stuck in some point which is > > returning a list of results to the user that have no > > duplicates. my question is: is there any way i can remove > > duplicates with that r returned from the search in the hits > > or i should manage it manually ?? > > > > > > thanks in advance > > > > > > Yours > > > > Heba > > > > > > - > > Moody friends. Drama queens. Your life? Nope! - their life, > > your story. > > Play Sims Stories at Yahoo! Games. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > Yours > > Heba > > > - > Shape Yahoo! in your own image. Join our Network Research > Panel today! > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene suggest
On 8/21/07, Heba Farouk <[EMAIL PROTECTED]> wrote: > the documents are not duplicated, i mean the hits (assume that 2 documents > have the same subject but with different authors, so if i'm searching the > subject, the returned hits will have duplicates ) > i was asking if i can remove duplicates from the hits?? You may not want to work with documents at all (where you have the duplicates), but rather with the terms in your index directly. Take a look at WildcardTermEnum etc. Ciao, Jens - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple Documents sharing a common boost
Is it possible to have multiple documents share a common boost? An example scenario is as follows. The set of documents are clustered into some set of clusters. Each cluster has a unique clusterId. So each document has a cluster Id field that associates each document with its cluster. Each cluster has a property called cluster score. Each document has to be boosted by its cluster score. The number of clusters is very small in comparison to the number of documents (around 100 clusters).The cluster score is updated on a continual basis. So the cluster score cant be stored as the document boost for each individual document as we end up updating all the documents boost daily which seems infeasible. We are trying to find out a solution that is more efficient. Thank you.
Re: Multiple Documents sharing a common boost
One solution is to keep meta-data in your index. Remember that documents do not all have to have the same field. So you could index a document with a single field "metadatanotafieldinanyotherdoc" that contains, say, a list of all of your clusters and their boosts. Read this document in at startup time and cache it away in your server. Thereafter, you have a set of boosts that can be applied at query time. Of course this useless if you wanted to boost at index time. But I know of no way to change the boost of a document without deleting and readding it with the new boost. Best Erick On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote: > > Is it possible to have multiple documents share a common boost? > > An example scenario is as follows. The set of documents are clustered into > some set of clusters. Each cluster has a unique clusterId. So each > document > has a cluster Id field that associates each document with its cluster. > Each > cluster has a property called cluster score. Each document has to be > boosted > by its cluster score. The number of clusters is very small in comparison > to > the number of documents (around 100 clusters).The cluster score is updated > on a continual basis. So the cluster score cant be stored as the document > boost for each individual document as we end up updating all the documents > boost daily which seems infeasible. We are trying to find out a solution > that is more efficient. > > Thank you. >
Re: Multiple Documents sharing a common boost
do you mean to say that we generate a compound query by AND ing the original query with a query like ( (cluster_id=0)^boost_cluster0 OR (cluster_id=1)^boost_cluster1...) ) But is this not inefficient considering that the number of clusters is in hundreds ?? On 8/21/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > One solution is to keep meta-data in your index. Remember that > documents do not all have to have the same field. So you could > index a document with a single field > "metadatanotafieldinanyotherdoc" that contains, say, a list of > all of your clusters and their boosts. Read this document in at > startup time and cache it away in your server. Thereafter, you have > a set of boosts that can be applied at query time. > > Of course this useless if you wanted to boost at index time. > But I know of no way to change the boost of a document > without deleting and readding it with the new boost. > > Best > Erick > > On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote: > > > > Is it possible to have multiple documents share a common boost? > > > > An example scenario is as follows. The set of documents are clustered > into > > some set of clusters. Each cluster has a unique clusterId. So each > > document > > has a cluster Id field that associates each document with its cluster. > > Each > > cluster has a property called cluster score. Each document has to be > > boosted > > by its cluster score. The number of clusters is very small in comparison > > to > > the number of documents (around 100 clusters).The cluster score is > updated > > on a continual basis. So the cluster score cant be stored as the > document > > boost for each individual document as we end up updating all the > documents > > boost daily which seems infeasible. We are trying to find out a solution > > that is more efficient. > > > > Thank you. > > >
Re: Multiple Documents sharing a common boost
Ahhh, I was assuming you didn't need to look at all clusters. Oops. That said, the question is really whether this is "good enough" compared to re-indexing, and only some tests will determine that. I was surprised at how quickly a *large* number of ORs was processed by Lucene. You could also think about implementing a HitCollector that boosted the raw score of each document based upon the cluster ID, but be careful not to read the full document in the HitCollector (you shouldn't have to though, either make a map early or get creative with filters). You might find useful information looking through the mail archive for "faceting", as this seems like a similar topic. But I wouldn't go anywhere with anything custom until and unless I'd satisfied myself that the simple approach of letting Lucene handle a large set of OR clauses wasn't performant. Several very bright people put significant effort in to performance, I'd see if they've already done the hard part . Erick On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote: > > do you mean to say that we generate a compound query by AND ing the > original > query with a query like > > ( (cluster_id=0)^boost_cluster0 OR (cluster_id=1)^boost_cluster1...) ) > > But is this not inefficient considering that the number of clusters is in > hundreds ?? > > > > > > On 8/21/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > > > One solution is to keep meta-data in your index. Remember that > > documents do not all have to have the same field. So you could > > index a document with a single field > > "metadatanotafieldinanyotherdoc" that contains, say, a list of > > all of your clusters and their boosts. Read this document in at > > startup time and cache it away in your server. Thereafter, you have > > a set of boosts that can be applied at query time. > > > > Of course this useless if you wanted to boost at index time. > > But I know of no way to change the boost of a document > > without deleting and readding it with the new boost. > > > > Best > > Erick > > > > On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote: > > > > > > Is it possible to have multiple documents share a common boost? > > > > > > An example scenario is as follows. The set of documents are clustered > > into > > > some set of clusters. Each cluster has a unique clusterId. So each > > > document > > > has a cluster Id field that associates each document with its cluster. > > > Each > > > cluster has a property called cluster score. Each document has to be > > > boosted > > > by its cluster score. The number of clusters is very small in > comparison > > > to > > > the number of documents (around 100 clusters).The cluster score is > > updated > > > on a continual basis. So the cluster score cant be stored as the > > document > > > boost for each individual document as we end up updating all the > > documents > > > boost daily which seems infeasible. We are trying to find out a > solution > > > that is more efficient. > > > > > > Thank you. > > > > > >
benefit of combining fields into one vs. booleanQuery
Hi everyone, My question : i have medias with a "title" field and a "caption" field and a "keywords" field. I want to be able to search in those 3 fields at the same time. For example, if i search "black car" the boolean query looks like this combination of termqueries: (title=black or keywords=black or caption=black) and (title=car or keywords= car or caption= car). So if "black" is in caption and "car" is in title I must find the media. I'm afraid that those boolean queries will be slow when there are a lot of terms in the query. I can, at index time add a "fulltext" field to each media that will contains title, caption and keywords concatenated. the query becomes : (fulltext=black and fulltext=car), much simpler. But i must still be able to search only in title or only in caption or only in keywords, so I must still add the other fields, doubling the indexed terms. Has someone done a similar thing? Is it worth it, or will the First boolean query remain fast enough? Thx, Antoine -- Antoine Baudoux Development Manager [EMAIL PROTECTED] Tél.: +32 2 333 58 44 GSM: +32 499 534 538 Fax.: +32 2 648 16 53
Re: benefit of combining fields into one vs. booleanQuery
A three field boolean query isn't very complex, so I don't think that's a problem. Although it does depend a bit upon how many terms you allow. But I'd try the simplest thing first, which would be to put all the terms in a fulltext field as well as in individual terms. Then get some performance measurements and some idea of what the total size of the index will be and make some decisions at that point. It's actually remarkably easy to switch from one of those solutions to the other if performance isn't what you need. In general, I've had better luck not worrying about space and going for simple code, *then* changing things around if there's a problem. Best Erick On 8/21/07, Antoine Baudoux <[EMAIL PROTECTED]> wrote: > > Hi everyone, > > My question : i have medias with a "title" field and a "caption" > field and a "keywords" field. > > I want to be able to search in those 3 fields at the same time. For > example, if i search "black car" the boolean query looks like this > combination of termqueries: > > (title=black or keywords=black or caption=black) and (title=car or > keywords= car or caption= car). > > So if "black" is in caption and "car" is in title I must find the media. > > I'm afraid that those boolean queries will be slow when there are a > lot of terms in the query. > > I can, at index time add a "fulltext" field to each media that will > contains title, caption and keywords concatenated. > > the query becomes : (fulltext=black and fulltext=car), much simpler. > > But i must still be able to search only in title or only in caption > or only in keywords, so I must still add the other fields, doubling > the indexed terms. > > Has someone done a similar thing? Is it worth it, or will the First > boolean query remain fast enough? > > Thx, > > Antoine > > > > > -- > Antoine Baudoux > Development Manager > [EMAIL PROTECTED] > Tél.: +32 2 333 58 44 > GSM: +32 499 534 538 > Fax.: +32 2 648 16 53 > > >
Re: NPE in IndexReader
Eric Louvard wrote: > Hello while calling IndexReader.deletedoc(int) I am becomming a NPE. > > java.lang.NullPointerException >at > org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java:658) >at > org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:686) > > In the acquireWriteLock methode there is call > 'segmentInfos.getVersion()', but segmentInfos should be 'null'. > > I am working with the head revision from SVN. > > May someone tell me a work arround. > > regards, Éric Louvard. > Hi Eric, could you please provide more information about how exactly you create the IndexReader? A unit test that hits this exception would be even better! Thanks, - Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NPE in IndexReader
Hi, Eric, I think I have the same problem. I found out in latest MultiReader.java, the "SegmentInfos" is set to null. public MultiReader(IndexReader[] subReaders) throws IOException { super(subReaders.length == 0 ? null : subReaders[0].directory(), null, false, subReaders); } However, segmentInfos are used in several places, causing NPEs. For example, in IndexReader.acquireWriteLock(), if (SegmentInfos.readCurrentVersion(directory) > segmentInfos.getVersion()) { So I think MultiReader.java need some adjustments. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes On 8/21/07, Michael Busch <[EMAIL PROTECTED] > wrote: > > Eric Louvard wrote: > > Hello while calling IndexReader.deletedoc(int) I am becomming a NPE. > > > > java.lang.NullPointerException > >at > > org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java > :658) > >at > > org.apache.lucene.index.IndexReader.deleteDocument (IndexReader.java > :686) > > > > In the acquireWriteLock methode there is call > > 'segmentInfos.getVersion()', but segmentInfos should be 'null'. > > > > I am working with the head revision from SVN. > > > > May someone tell me a work arround. > > > > regards, Éric Louvard. > > > > Hi Eric, > > could you please provide more information about how exactly you create > the IndexReader? A unit test that hits this exception would be even > better! > > Thanks, > - Michael > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: NPE in IndexReader
: I found out in latest MultiReader.java, the "SegmentInfos" is set to null. : However, segmentInfos are used in several places, causing NPEs. : For example, in IndexReader.acquireWriteLock(), MultiReader was refactored into two classes: MultiReader which is now only constructed from other readers, and MultiSegmentReader which is what IndexReader.open returns when a directory contans multiple segments ... segmentInfos shouldn't be needed in the first case -- and doesn't make much sense at all. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NPE in IndexReader
Right now I am very confused. I agree segmentInfos are not needed in this case. But it's used in aquireWriteLock(). What should we do? -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes On 8/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > : I found out in latest MultiReader.java, the "SegmentInfos" is set to > null. > > : However, segmentInfos are used in several places, causing NPEs. > : For example, in IndexReader.acquireWriteLock(), > > MultiReader was refactored into two classes: MultiReader which is now only > constructed from other readers, and MultiSegmentReader which is what > IndexReader.open returns when a directory contans multiple segments ... > segmentInfos shouldn't be needed in the first case -- and doesn't make > much sense at all. > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: NPE in IndexReader
: I agree segmentInfos are not needed in this case. But it's used in : aquireWriteLock(). What should we do? This is one of the reasons why i was suggesting in a java-dev thread that *all* of the refrences to SegmentInfos be refactored out of IndexReader and into the subclasses -- any attempt to access the SegmentInfos in a MultiReader constructed from an IndexReader[] doesn't make sense -- and it never has. In past releases, any operation in a MultiReader constructed from subReaders that attempted to use SegmentInfos (like aquireWriteLock) would never work properly (at best it would lock the first subReader) ...hmm, looking at this more and getting more confused... I was going to say that this was all a red herring and couldn't cause the NPE, since the only time acquireWriteLock is (or ever has been) called is when the IndexReader knows it owns the directory, but a quick skim of MultiReader on the trunk to try and find where it tells the super class it doesn't own the directory made me realize it's not there ... MultiReader extends MultiSegmentReader and passes info up the chain of super constructors about ot closing the Directory on close, but there is no info passed up about not owning the directory. skimming 2.2 i don't see how this ever worked "corectly" ... it wouldn't NPE, but it looks like any attempt at doing anything on a MultiReader would always aqcuire the write lock on the *first* sub reader, then delegate to the correct subreader. it appears that this is yet another flaw in the old impl exposed via the refactoring done so far (and further motivation for continued refactoring untill IndexReader is a glorified abstract class with no meat in it) (can someone sanity check my assessment ... i'm a little dizzy from flipping back and forth between various versions of IndexReader, MultiReader and MultiSegmentReader) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]