Re: Can I do boosting based on term postions?
Cedric, SpanFirstQuery could be a solution without payloads. You may want to give it your own Similarity.sloppyFreq() . Regards, Paul Elschot On Thursday 02 August 2007 04:07, Cedric Ho wrote: > Thanks for the quick response =) > > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote: > > Yes, it is easily doable through "Payload" facility. During indexing process > > (mainly tokenization), you need to push this extra information in each > > token. And then you can use BoostingTermQuery for using Payload value to > > include Payload in the score. You also need to implement Similarity for this > > (mainly scorePayload method). > > If I store, say a custom boost factor as Payload, does it means that I > will store one more byte per term per document in the index file? So > the index file would be much larger? > > > > > Other way can be to extend SpanTermQuery, this already calculates the > > position of match. You just need to do something to use this position value > > in the score calculation. > > I see that SpanTermQuery takes a TermPositions from the indexReader > and I can get the term position from there. However I am not sure how > to incorporate it into the score calculation. Would you mind give a > little more detail on this? > > > > > One possible advantage of SpanTermQuery approach is that you can play > > around, without re-creating indices everytime. > > > > Thanks, > > Shailendra Sharma, > > CTO, Ver se' Innovation Pvt. Ltd. > > Bangalore, India > > > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > > > > > Hi all, > > > > > > I was wondering if it is possible to do boosting by search terms' > > > position in the document. > > > > > > for example: > > > search terms appear in the first 100 words, or first 10% words, or in > > > first two paragraphs would be given higher score. > > > > > > Is it achievable through using the new Payload function in lucene 2.2? > > > Or are there any easier ways to achieve these ? > > > > > > > > > Regards, > > > Cedric > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > Thanks, > Cedric > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexReader deletes more than expected
> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 > 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > > If I'm reading this correctly, there's something a little wonky here. In> > your example code, you close the IndexWriter and then, without creating> a > new IndexWriter, you call addDocument again. This shouldn't be> possible > (what version of Lucene are you using?) Yes, you are correct, I close > indexWriter and then add more docs. What's wrong? it worked out fine, and add > docs i add will appear to NEW INSTANCES OF INDEX SEARCHERS after calling > close on the indexWriter.As for creating new IndexWriter, I tried to, > however i suffered of the lock exception i got, although i was closing the > IndexWriter instance, before creating a new IndexWriter instance! I don't > know why! furthermore, this is useless, for multiThreaded app, cause you > can't know who is still writing to your index, and who has colsed his > IndexWriter. Even checking that the index is locked b4, leads to unnecessary > overhead which can be avoided since it works for me and i can write with one > single instance of IndexWriter. > > Assuming for the time being that you are > creating the IndexWriter again,> the other issue here is that you shouldn't > be able to have a reader and> a writer changing an index at the same time. > There should be a lock> failure. This should occur either in the Index Well, > I think that i don't get the problems you expect cause i use the Lucene > version that is shiped by compass distribution (www.compassframework.org) In > short, compass to Lucene is the same of ORM server like hibernate for DBMS > like oracle. It's really works fine but i couldn't understand why compass > hides the deleteDocuments(Term) method on IndexWriter > classhttp://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term). > This is why i used delete on a reader rather than using it on the same > writer instance, and the only one i have. I couldn't manage my index in one > particular situation using compass, because i had to store data not in the > usual way we do (every row in table is a record). So i think i have to ask > The compass team about that. Anyways, if you have coments, you or the others > please do. > > Might you be creating your IndexWriters (which you don't show) > with the> create flag always set to true? That will wipe your index each > time,> ignoring the locks and cause all sorts of weird results.No, I don't > create a new Instance of indexWriter, the only one I create is in the Service > constructor, so i create a new clean (has no docs) index only when program > starts up. public LuceneServiceSHImp(String indexDirectory) throws > IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new > StandardAnalyzer();indexWriter = new IndexWriter(new > java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();}> > > -Original Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> > Subject: IndexReader deletes more that expected> > Hi, I got unexpected > behavior while testing lucene. To shortly address> the problem: Using > IndexWriter I add docs with fields named ID with a> consecutive order > (1,2,3,4, etc) then close that index. I get new> IndexReader, and call > IndexReader.deleteDocuments(Term). The term is> simply new Term("ID", "1"). > and then class close on IndexReader. Things> work out fine. But if i add docs > using IndexWriter, close writer, then> create new IndexReader to delete one > of the docs already inserted, but> without closing index. while the > indexReader that perform deletion is> still not closed, I add more docs, and > then commit the IndexWriter, so> when i search I get all added docs in the > two phases (before using> deleteDocuments() on IndexReader and after because > i haven't closed> IndexReader, howerer, closed IndexWriter). I close > IndexReader and then> query the index, so i deletes all docs after opening it > till closing it,> in addition to the specified doc in the Term object (in > this test case:> ID=1). I know that i can avoid this by close IndexReader > directly after> deleting docs, but what about runing it on mutiThread app > like web> application? There you are the code: > IndexSearcher indexSearcher > = new IndexSearcher(this.indexDirectory);> Hits hitsB4InsertAndClose = null;> > hitsB4InsertAndClose = getAllAsHits(indexSearcher);> int beforeInsertAndClose > = hitsB4InsertAndClose.length();> > > indexWriter.addDocument(getNewElement());> > indexWriter.addDocument(getNewElement());> > indexWriter.addDocument(getNewElement());> indexWriter.close();> > IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);> > indexSe
RE: IndexReader deletes more that expected
> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 > 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > > If I'm reading this correctly, there's something a little wonky here. In> > your example code, you close the IndexWriter and then, without creating> a > new IndexWriter, you call addDocument again. This shouldn't be> possible > (what version of Lucene are you using?) Yes, you are correct, I close indexWriter and then add more docs. What's wrong? it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX SEARCHERS after calling close on the indexWriter. As for creating new IndexWriter, I tried to, however i suffered of the lock exception i got, although i was closing the IndexWriter instance, before creating a new IndexWriter instance! I don't know why! furthermore, this is useless, for multiThreaded app, cause you can't know who is still writing to your index, and who has colsed his IndexWriter. Even checking that the index is locked b4, leads to unnecessary overhead which can be avoided since it works for me and i can write with one single instance of IndexWriter. > > > Assuming for the time being that you are creating the IndexWriter again,> > > the other issue here is that you shouldn't be able to have a reader and> a > > writer changing an index at the same time. There should be a lock> failure. > > This should occur either in the Index Well, I think that i don't get the problems you expect cause i use the Lucene version that is shiped by compass distribution (www.compassframework.org) In short, compass to Lucene is the same of ORM server like hibernate for DBMS like oracle. It's really works fine but i couldn't understand why compass hides the deleteDocuments(Term) method on IndexWriter class http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term) . This is why i used delete on a reader rather than using it on the same writer instance, and the only one i have. I couldn't manage my index in one particular situation using compass, because i had to store data not in the usual way we do (every row in table is a record). So i think i have to ask The compass team about that. Anyways, if you have coments, you or the others please do. > > > Might you be creating your IndexWriters (which you don't show) with the> > > create flag always set to true? That will wipe your index each time,> > > ignoring the locks and cause all sorts of weird results. No, I don't create a new Instance of indexWriter, the only one I create is in the Service constructor, so i create a new clean (has no docs) index only when program starts up. public LuceneServiceSHImp(String indexDirectory) throws IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new StandardAnalyzer();indexWriter = new IndexWriter(new java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();} > > -Original Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> > > Subject: IndexReader deletes more that expected> > Hi, I got unexpected > > behavior while testing lucene. To shortly address> the problem: Using > > IndexWriter I add docs with fields named ID with a> consecutive order > > (1,2,3,4, etc) then close that index. I get new> IndexReader, and call > > IndexReader.deleteDocuments(Term). The term is> simply new Term("ID", "1"). > > and then class close on IndexReader. Things> work out fine. But if i add > > docs using IndexWriter, close writer, then> create new IndexReader to > > delete one of the docs already inserted, but> without closing index. while > > the indexReader that perform deletion is> still not closed, I add more > > docs, and then commit the IndexWriter, so> when i search I get all added > > docs in the two phases (before using> deleteDocuments() on IndexReader and > > after because i haven't closed> IndexReader, howerer, closed IndexWriter). > > I close IndexReader and then> query the index, so i deletes all docs after > > opening it till closing it,> in addition to the specified doc in the Term > > object (in this test case:> ID=1). I know that i can avoid this by close > > IndexReader directly after> deleting docs, but what about runing it on > > mutiThread app like web> application? There you are the code: > > > IndexSearcher indexSearcher = new IndexSearcher(this.indexDirectory);> Hits > > hitsB4InsertAndClose = null;> hitsB4InsertAndClose = > > getAllAsHits(indexSearcher);> int beforeInsertAndClose = > > hitsB4InsertAndClose.length();> > > > indexWriter.addDocument(getNewElement());> > > indexWriter.addDocument(getNewElement());> > > indexWriter.addDocument(getNewElement());> indexWriter.close();> > > IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);>
RE: IndexReader deletes more that expected
Yes, you are correct, I close indexWriter and then add more docs. What's wrong? it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX SEARCHERS after calling close on the indexWriter. As for creating new IndexWriter, I tried to, however i suffered of the lock exception i got, although i was closing the IndexWriter instance, before creating a new IndexWriter instance! I don't know why! furthermore, this is useless, for multiThreaded app, cause you can't know who is still writing to your index, and who has colsed his IndexWriter. Even checking that the index is locked b4, leads to unnecessary overhead which can be avoided since it works for me and i can write with one single instance of IndexWriter. Well, I think that i don't get the problems you expect cause i use the Lucene version that is shiped by compass distribution (www.compassframework.org) In short, compass to Lucene is the same of ORM server like hibernate for DBMS like oracle. It's really works fine but i couldn't understand why compass hides the deleteDocuments(Term) method on IndexWriter class http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term) . This is why i used delete on a reader rather than using it on the same writer instance, and the only one i have. I couldn't manage my index in one particular situation using compass, because i had to store data not in the usual way we do (every row in table is a record). So i think i have to ask The compass team about that. Anyways, if you have coments, you or the others please do. No, I don't create a new Instance of indexWriter, the only one I create is in the Service constructor, so i create a new clean (has no docs) index only when program starts up. public LuceneServiceSHImp(String indexDirectory) throws IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new StandardAnalyzer();indexWriter = new IndexWriter(new java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();} > Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 > 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > > If I'm reading this correctly, there's something a little wonky here. In> > your example code, you close the IndexWriter and then, without creating> a > new IndexWriter, you call addDocument again. This shouldn't be> possible > (what version of Lucene are you using?)> > Assuming for the time being that > you are creating the IndexWriter again,> the other issue here is that you > shouldn't be able to have a reader and> a writer changing an index at the > same time. There should be a lock> failure. This should occur either in the > Index > > Might you be creating your IndexWriters (which you don't show) with > the> create flag always set to true? That will wipe your index each time,> > ignoring the locks and cause all sorts of weird results.> > -Original > Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > Sent: > Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> Subject: > IndexReader deletes more that expected> > Hi, I got unexpected behavior while > testing lucene. To shortly address> the problem: Using IndexWriter I add docs > with fields named ID with a> consecutive order (1,2,3,4, etc) then close that > index. I get new> IndexReader, and call IndexReader.deleteDocuments(Term). > The term is> simply new Term("ID", "1"). and then class close on IndexReader. > Things> work out fine. But if i add docs using IndexWriter, close writer, > then> create new IndexReader to delete one of the docs already inserted, but> > without closing index. while the indexReader that perform deletion is> still > not closed, I add more docs, and then commit the IndexWriter, so> when i > search I get all added docs in the two phases (before using> > deleteDocuments() on IndexReader and after because i haven't closed> > IndexReader, howerer, closed IndexWriter). I close IndexReader and then> > query the index, so i deletes all docs after opening it till closing it,> in > addition to the specified doc in the Term object (in this test case:> ID=1). > I know that i can avoid this by close IndexReader directly after> deleting > docs, but what about runing it on mutiThread app like web> application? There > you are the code: > IndexSearcher indexSearcher = new > IndexSearcher(this.indexDirectory);> Hits hitsB4InsertAndClose = null;> > hitsB4InsertAndClose = getAllAsHits(indexSearcher);> int beforeInsertAndClose > = hitsB4InsertAndClose.length();> > > indexWriter.addDocument(getNewElement());> > indexWriter.addDocument(getNewElement());> > indexWriter.addDocument(getNewElement());> indexWriter.close();> > IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);> > indexSearcherDel.getIndexReader().deleteDocuments(new Term("ID",
Getting only the Ids, not the whole documents.
Hi all, Can I get just a list of document Ids given a search criteria ? To elaborate here is my situation: I store 2 contracts in the file system index each with some parameterName and Value. Given a search criterion - (paramValue='draft'). I need to get just an ArrayList of Strings containing contract Ids. I dont need the lucene documents, just the Ids. Can this be done ? -thanks -- View this message in context: http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Getting only the Ids, not the whole documents.
What is the structure of your index? If you havnt already, then add a new field to your index that stores the contractId. For all other fields, set the "store" flag to false while indexing. You can now safely retrieve the value of this contractId field based on your search results. Regards, kapilChhabra -Original Message- From: makkhar [mailto:[EMAIL PROTECTED] Sent: Thursday, August 02, 2007 2:26 PM To: java-user@lucene.apache.org Subject: Getting only the Ids, not the whole documents. Hi all, Can I get just a list of document Ids given a search criteria ? To elaborate here is my situation: I store 2 contracts in the file system index each with some parameterName and Value. Given a search criterion - (paramValue='draft'). I need to get just an ArrayList of Strings containing contract Ids. I dont need the lucene documents, just the Ids. Can this be done ? -thanks -- View this message in context: http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t f4204907.html#a11960750 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting only the Ids, not the whole documents.
you should not store them in an Array structure since they will take up the memory. the BitSet is the best structure to store them On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: > > > Heres my index structure : > > Document -> contract ID -id (index AND store) > -> paramName -name (index AND store) > -> paramValue -value (index BUT NOT store) > > When I get back 2 hits, each document contains ID and paramName, I > have > no interest in paramName (but I have to STORE it for some other reason), > can > I not just get a plain java String Array of the contract IDs that matched > ? > ! > > -thanks for the prompt reply. > > > > Chhabra, Kapil wrote: > > > > What is the structure of your index? > > If you havnt already, then add a new field to your index that stores the > > contractId. For all other fields, set the "store" flag to false while > > indexing. > > > > You can now safely retrieve the value of this contractId field based on > > your search results. > > > > Regards, > > kapilChhabra > > > > > > -Original Message- > > From: makkhar [mailto:[EMAIL PROTECTED] > > Sent: Thursday, August 02, 2007 2:26 PM > > To: java-user@lucene.apache.org > > Subject: Getting only the Ids, not the whole documents. > > > > > > Hi all, > > > >Can I get just a list of document Ids given a search criteria ? To > > elaborate here is my situation: > > > > I store 2 contracts in the file system index each with some > > parameterName and Value. Given a search criterion - > > (paramValue='draft'). I > > need to get just an ArrayList of Strings containing contract Ids. I dont > > need the lucene documents, just the Ids. > > > > Can this be done ? > > > > -thanks > > > > -- > > View this message in context: > > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t > > f4204907.html#a11960750 > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > -- > View this message in context: > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961211 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/
Re: Getting only the Ids, not the whole documents.
yes it decrease the performance but the only solution. I've spent many weeks to find best way to retrive my own IDs but find this way as last one now I am storing the ids in a BitSet structure and it's fast enough public void collect(...){ idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID"))); } On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: > > > > Hi, > >The solution you suggested will definitely work but will definitely > slow > down my search by an order of magnitude. The problem I am trying to solve > is > performance, thats why I need the collection of IDs and not the whole > documents. > > - thanks for the prompt reply. > > > is_maximum wrote: > > > > yes if you extend your class from HitCollector and override the > collect() > > mthod with following signature you can get IDs > > > > public void collect(int id, float score) > > > > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: > >> > >> > >> Hi all, > >> > >>Can I get just a list of document Ids given a search criteria ? To > >> elaborate here is my situation: > >> > >> I store 2 contracts in the file system index each with some > >> parameterName and Value. Given a search criterion - > (paramValue='draft'). > >> I > >> need to get just an ArrayList of Strings containing contract Ids. I > dont > >> need the lucene documents, just the Ids. > >> > >> Can this be done ? > >> > >> -thanks > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750 > >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >> > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > > -- > > Regards, > > Mohammad > > -- > > see my blog: http://brainable.blogspot.com/ > > another in Persian: http://fekre-motefavet.blogspot.com/ > > > > > > -- > View this message in context: > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/
RE: Getting only the Ids, not the whole documents.
Heres my index structure : Document -> contract ID -id (index AND store) -> paramName -name (index AND store) -> paramValue -value (index BUT NOT store) When I get back 2 hits, each document contains ID and paramName, I have no interest in paramName (but I have to STORE it for some other reason), can I not just get a plain java String Array of the contract IDs that matched ? ! -thanks for the prompt reply. Chhabra, Kapil wrote: > > What is the structure of your index? > If you havnt already, then add a new field to your index that stores the > contractId. For all other fields, set the "store" flag to false while > indexing. > > You can now safely retrieve the value of this contractId field based on > your search results. > > Regards, > kapilChhabra > > > -Original Message- > From: makkhar [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 02, 2007 2:26 PM > To: java-user@lucene.apache.org > Subject: Getting only the Ids, not the whole documents. > > > Hi all, > >Can I get just a list of document Ids given a search criteria ? To > elaborate here is my situation: > > I store 2 contracts in the file system index each with some > parameterName and Value. Given a search criterion - > (paramValue='draft'). I > need to get just an ArrayList of Strings containing contract Ids. I dont > need the lucene documents, just the Ids. > > Can this be done ? > > -thanks > > -- > View this message in context: > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t > f4204907.html#a11960750 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961211 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting only the Ids, not the whole documents.
Hi, The solution you suggested will definitely work but will definitely slow down my search by an order of magnitude. The problem I am trying to solve is performance, thats why I need the collection of IDs and not the whole documents. - thanks for the prompt reply. is_maximum wrote: > > yes if you extend your class from HitCollector and override the collect() > mthod with following signature you can get IDs > > public void collect(int id, float score) > > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: >> >> >> Hi all, >> >>Can I get just a list of document Ids given a search criteria ? To >> elaborate here is my situation: >> >> I store 2 contracts in the file system index each with some >> parameterName and Value. Given a search criterion - (paramValue='draft'). >> I >> need to get just an ArrayList of Strings containing contract Ids. I dont >> need the lucene documents, just the Ids. >> >> Can this be done ? >> >> -thanks >> >> -- >> View this message in context: >> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Regards, > Mohammad > -- > see my blog: http://brainable.blogspot.com/ > another in Persian: http://fekre-motefavet.blogspot.com/ > > -- View this message in context: http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting only the Ids, not the whole documents.
yes if you extend your class from HitCollector and override the collect() mthod with following signature you can get IDs public void collect(int id, float score) On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: > > > Hi all, > >Can I get just a list of document Ids given a search criteria ? To > elaborate here is my situation: > > I store 2 contracts in the file system index each with some > parameterName and Value. Given a search criterion - (paramValue='draft'). > I > need to get just an ArrayList of Strings containing contract Ids. I dont > need the lucene documents, just the Ids. > > Can this be done ? > > -thanks > > -- > View this message in context: > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/
Do AND + OR Search in Lucene
Hey Guys, Quick question: I do this in my code for searching: queryParser.setDefaultOperator(QueryParser.Operator.AND); Lucene is OR by default so I change it to AND for my requirements. Now, I have a requirement to do OR as well. I mean while doing AND I'd like to include results from OR too ... but they'll be much lower ranked than the AND results. Is there a way to do this ? thanks, AZ
Re: Do AND + OR Search in Lucene
You can create two queries from two query parser, one with AND and the other one with OR. After you create both of them, you call setBoost() to give different boost level and then join them together using BooleanQuery with option BooleanClause.Occur.SHOULD. That should do the trick. askarzaidi wrote: > > Hey Guys, > > Quick question: > > I do this in my code for searching: > > queryParser.setDefaultOperator(QueryParser.Operator.AND); > > Lucene is OR by default so I change it to AND for my requirements. Now, I > have a requirement to do OR as well. I mean while doing AND I'd like to > include results from OR too ... but they'll be much lower ranked than the > AND results. > > Is there a way to do this ? > > thanks, > AZ > > -- View this message in context: http://www.nabble.com/Do-AND-%2B-OR-Search-in-Lucene-tf4205268.html#a11962340 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: High CPU usage duing index and search
20,000 queries continuously? Sounds a bit too much. Can you elaborate more what you need to do? Probably you won't need that many queries. Chew Yee Chuang wrote: > > Hi, > > Thanks for the link provided, actually I've go through those article when > I > developing the index and search function for my application. I haven’t try > profiler yet, but I monitor the CPU usage and notice that whatever index > or > search performing, the CPU usage raise to 100%. Below I will try to > elaborate more on what my application is doing and how I index and search. > > There are many concurrent process running, first, the application will > write > records that received into a text file with tab separated each different > field. Application will point to a new file every 10mins and start writing > to it. So every file will contains only 10mins record, approximate 600,000 > records per file. Then, the indexing process will check whether there is a > text file to be index, if it is, the thread will wake up and start perform > indexing. > > The indexing process will first add documents to RAMDir, Then later, add > RAMDir into FSDir by calling addIndexNoOptimize() when there is 100,000 > documents(32 fields per doc) in RAMDir. There is only 1 IndexWriter(FSDir) > was created but a few IndexWriter(RAMDir) was created during the whole > process. Below are some configuration for IndexWriters that I mentioned:- > > IndexWriter (RAMDir) > - SimpleAnalyzer > - setMaxBufferedDocs(1) > - Filed.Store.YES > - Field.Index.NO_NORMS > > IndexWriter (FSDir) > - SimpleAnalyzer > - setMergeFactor(20) > - addIndexesNoOptimize() > > For the searching, because there are many queries(20,000) run continuously > to generate the aggregate table for reporting purpose. All this queries is > run in nested loop, and there is only 1 Searcher created, I try searcher > and > filter as well, filter give me better result, but both also utilize lots > of > CPU resources. > > Hope this info will help and sorry for my bad English. > > Thanks > eChuang, Chew > > -Original Message- > From: karl wettin [mailto:[EMAIL PROTECTED] > Sent: Tuesday, July 31, 2007 5:54 PM > To: java-user@lucene.apache.org > Subject: Re: High CPU usage duing index and search > > > 31 jul 2007 kl. 05.25 skrev Chew Yee Chuang: >> But just notice that when Lucene performing search or index, >> the CPU usage on my machine raise to 100%, because of this issue, >> some of my >> others backend process will slow down eventually. Just want to know >> does >> anyone face this problem before ? and is it any idea on how to >> overcome this >> problem ? > > Did you run a profiler to see what it is that consume all the resources? > It is very hard to guess based on the information you supplied. Start > here: > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed > > > -- > karl > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.5.476 / Virus Database: 269.11.0/927 - Release Date: 7/30/2007 > 5:02 PM > > > No virus found in this outgoing message. > Checked by AVG Free Edition. > Version: 7.5.476 / Virus Database: 269.11.0/929 - Release Date: 7/31/2007 > 5:26 PM > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/High-CPU-usage-duing-index-and-search-tf4190756.html#a11962524 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: extracting non-english text from word, pdf, etc....??
If you can extract token stream from those files already, you can simply use different analyzers to analyze those token stream appropriately. Check out Lucen-contrib analyzers at http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ heybluez wrote: > > I know how to do english text with POI and PDFBox and so on. Now, I want > to start indexing non-english language such as french and spanish. Which > extraction libs are available for me? > > I want to do: > > Excel > Word > PowerPoint > PDF > HTML > RTF > > Thanks! > Michael > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc---tf4198171.html#a11962580 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE-843 Release
Mike, as a committer, what do you think? Thanks! Peter Keegan wrote: > > I've built a production index with this patch and done some query stress > testing with no problems. > I'd give it a thumbs up. > > Peter > > On 7/30/07, testn <[EMAIL PROTECTED]> wrote: >> >> >> Hi guys, >> >> Do you think LUCENE-843 is stable enough? If so, do you think it's worth >> to >> release it with probably LUCENE 2.2.1? It would be nice so that people >> can >> take the advantage of it right away without risking other breaking >> changes >> in the HEAD branch or waiting until 2.3 release. >> >> Thanks, >> -- >> View this message in context: >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting only the Ids, not the whole documents.
Hi, Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an ability to help you load only the first field in the document without loading all the fields. After that, you can keep the whole document if you like. It should help improve performance better. is_maximum wrote: > > yes it decrease the performance but the only solution. > I've spent many weeks to find best way to retrive my own IDs but find this > way as last one > > now I am storing the ids in a BitSet structure and it's fast enough > > public void collect(...){ > idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID"))); > > } > > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: >> >> >> >> Hi, >> >>The solution you suggested will definitely work but will definitely >> slow >> down my search by an order of magnitude. The problem I am trying to solve >> is >> performance, thats why I need the collection of IDs and not the whole >> documents. >> >> - thanks for the prompt reply. >> >> >> is_maximum wrote: >> > >> > yes if you extend your class from HitCollector and override the >> collect() >> > mthod with following signature you can get IDs >> > >> > public void collect(int id, float score) >> > >> > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: >> >> >> >> >> >> Hi all, >> >> >> >>Can I get just a list of document Ids given a search criteria ? To >> >> elaborate here is my situation: >> >> >> >> I store 2 contracts in the file system index each with some >> >> parameterName and Value. Given a search criterion - >> (paramValue='draft'). >> >> I >> >> need to get just an ArrayList of Strings containing contract Ids. I >> dont >> >> need the lucene documents, just the Ids. >> >> >> >> Can this be done ? >> >> >> >> -thanks >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750 >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> >> >> >> - >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> > >> > >> > -- >> > Regards, >> > Mohammad >> > -- >> > see my blog: http://brainable.blogspot.com/ >> > another in Persian: http://fekre-motefavet.blogspot.com/ >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Regards, > Mohammad > -- > see my blog: http://brainable.blogspot.com/ > another in Persian: http://fekre-motefavet.blogspot.com/ > > -- View this message in context: http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11962465 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Nutch APIs in Lucene
Just use Nutch. If you look in the Crawl.java class in Nutch, you can pretty easily tear out the appropriate pieces. Question is, do you really need all of that? If so, why not just use Nutch? -Grant On Aug 2, 2007, at 2:32 AM, Srinivasarao Vundavalli wrote: How can we use nutch APIs in Lucene? For example using FetchedSegments , we can get ParseText from which we can get the content of the document. So can we use these classes (FetchedSegments, ParseText ) in lucene. If so, how to use them? Thank You -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE-843 Release
Honestly I don't really think this is a good idea. While LUCENE-843 has proven stable so far (knock on wood!), it is still a major change and I do worry (less with time :) that maybe I broke something subtle somewhere. While a few brave people have tested the trunk in their production worlds and seen good performance gains, that testing is still limited compared to a real release. A point release (2.4.1) really is not supposed to contain major changes, just bug fixes, and so I don't think we should violate that accepted practice. I would rather see us finish up 2.3 and release it, and going forwards do more frequent releases, instead of porting big changes back onto point releases. Mike "testn" <[EMAIL PROTECTED]> wrote: > > Mike, as a committer, what do you think? > > Thanks! > > > Peter Keegan wrote: > > > > I've built a production index with this patch and done some query stress > > testing with no problems. > > I'd give it a thumbs up. > > > > Peter > > > > On 7/30/07, testn <[EMAIL PROTECTED]> wrote: > >> > >> > >> Hi guys, > >> > >> Do you think LUCENE-843 is stable enough? If so, do you think it's worth > >> to > >> release it with probably LUCENE 2.2.1? It would be nice so that people > >> can > >> take the advantage of it right away without risking other breaking > >> changes > >> in the HEAD branch or waiting until 2.3 release. > >> > >> Thanks, > >> -- > >> View this message in context: > >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644 > >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >> > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Solr's NumberUtils doesnt work
How did you encode your integer into String? Did you use int2sortableStr? is_maximum wrote: > > Hi > I am using NumberUtils to encode and decode numbers while indexing and > searching, when I am going to decode the number retrieved from an index it > throws exception for some fields > the exception message is: > > Caused by: java.lang.StringIndexOutOfBoundsException: String index out of > range: 1 > at java.lang.String.charAt(Unknown Source) > at org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java > :125) > at > org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java:37) > at com.payvand.lucene.util.ExtendedNumberUtils.decodeInteger( > ExtendedNumberUtils.java:123) > > > I dont know why this happen, I am wondering if it has something to do with > character encoding. have you had such problem? > > thanks > > -- > Regards, > Mohammad Norouzi > -- > see my blog: http://brainable.blogspot.com/ > another in Persian: http://fekre-motefavet.blogspot.com/ > > -- View this message in context: http://www.nabble.com/Solr%27s-NumberUtils-doesnt-work-tf4204371.html#a11963213 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE-843 Release
Thanks! Will look forward to 2.3 then. Michael McCandless-2 wrote: > > > Honestly I don't really think this is a good idea. > > While LUCENE-843 has proven stable so far (knock on wood!), it is > still a major change and I do worry (less with time :) that maybe I > broke something subtle somewhere. > > While a few brave people have tested the trunk in their production > worlds and seen good performance gains, that testing is still limited > compared to a real release. > > A point release (2.4.1) really is not supposed to contain major > changes, just bug fixes, and so I don't think we should violate that > accepted practice. > > I would rather see us finish up 2.3 and release it, and going forwards > do more frequent releases, instead of porting big changes back onto > point releases. > > Mike > > "testn" <[EMAIL PROTECTED]> wrote: >> >> Mike, as a committer, what do you think? >> >> Thanks! >> >> >> Peter Keegan wrote: >> > >> > I've built a production index with this patch and done some query >> stress >> > testing with no problems. >> > I'd give it a thumbs up. >> > >> > Peter >> > >> > On 7/30/07, testn <[EMAIL PROTECTED]> wrote: >> >> >> >> >> >> Hi guys, >> >> >> >> Do you think LUCENE-843 is stable enough? If so, do you think it's >> worth >> >> to >> >> release it with probably LUCENE 2.2.1? It would be nice so that people >> >> can >> >> take the advantage of it right away without risking other breaking >> >> changes >> >> in the HEAD branch or waiting until 2.3 release. >> >> >> >> Thanks, >> >> -- >> >> View this message in context: >> >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644 >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> >> >> >> - >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11963778 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexReader deletes more that expected
Yes, you are right, thanks for the great reply! I skimmed it so quickly today, so re-read it now, and got the point you mean. I just tried Lucene 2.2.0 (I was using 2.0.0) and i could do add, delete and update docs so smoothly! Based on my tests i did so far, similar to tests I presented in my first email, that i don't have to worry who added and who deleted, and i can get rid of Synchronized java methods ant lead to so slow app performance. I kept maintaining only one open instance of indexWrtier for the whole app. As i stated b4, i suffered of having lock exception. I use flush() instead of close(). In contrast, I create new IndexSearcher instance every time i search. I dislike to open and close then reopen the index searcher over and over. I don't use Indexreader directly anymore, since i do have to use it indirectly using IndexSearcher. I won't try IndexModifier since you told me that IndexWriter in 2.2.0 is much better. Do you think i'm doing good this way i use IndexWriter (one instance for the whole app)? One thing still remaining pending, however I need compass guys for it, is that they do use the new version of lucene or not yet.. i will check with them anyways. I can't have two different versions of jars for the same classes in same package. Final question, I still haven't seen Solr in details, but is it strongly recommended to use it when i have webapps? please write back! cya Rid > Date: Wed, 1 Aug 2007 13:14:04 -0400> From: [EMAIL PROTECTED]> To: > java-user@lucene.apache.org> Subject: Re: IndexReader deletes more that > expected> On 8/1/07, Ridwan Habbal <[EMAIL PROTECTED]> wrote:>> but what > about runing it on mutiThread app like web application? There> you are the > code: If you are targeting a multi threaded webapp than I strongly suggest > youlook into using either Solr or the LuceneIndexAccessor code. You will > wantto use some form of reference counting to manage your Readers and > Writers. Also, you can now use IndexWriter (Lucene 2.0 and greater I think) > todelete. This allows for efficient mixing of deletes and adds by > bufferingthe deletes, and then opening an IndexReader to commit them later. > This ismuch more efficient than IndexModifier. - Mark _ PC Magazine’s 2007 editors’ choice for best web mail—award-winning Windows Live Hotmail. http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HMWL_mini_pcmag_0707
Re: extracting non-english text from word, pdf, etc....??
Yea, I have seen those. I guess the question is what do you all use to extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and so on? This is what I use now to extract english. Thanks, Michael testn wrote: If you can extract token stream from those files already, you can simply use different analyzers to analyze those token stream appropriately. Check out Lucen-contrib analyzers at http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ heybluez wrote: I know how to do english text with POI and PDFBox and so on. Now, I want to start indexing non-english language such as french and spanish. Which extraction libs are available for me? I want to do: Excel Word PowerPoint PDF HTML RTF Thanks! Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: extracting non-english text from word, pdf, etc....??
Check out.. http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e heybluez wrote: > > Yea, I have seen those. I guess the question is what do you all use to > extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and > so on? This is what I use now to extract english. > > Thanks, > Michael > > testn wrote: >> If you can extract token stream from those files already, you can simply >> use >> different analyzers to analyze those token stream appropriately. Check >> out >> Lucen-contrib analyzers at >> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ >> >> >> >> heybluez wrote: >> >>> I know how to do english text with POI and PDFBox and so on. Now, I >>> want >>> to start indexing non-english language such as french and spanish. >>> Which >>> extraction libs are available for me? >>> >>> I want to do: >>> >>> Excel >>> Word >>> PowerPoint >>> PDF >>> HTML >>> RTF >>> >>> Thanks! >>> Michael >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc---tf4198171.html#a11964422 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Do AND + OR Search in Lucene
Alternatively, construct a parenthesized query that reflects what you want. If you do, make sure that OR is capitalized, or make REAL SURE you understand the Lucene syntax and construct your query with that syntax. Erick On 8/2/07, testn <[EMAIL PROTECTED]> wrote: > > > You can create two queries from two query parser, one with AND and the > other > one with OR. After you create both of them, you call setBoost() to give > different boost level and then join them together using BooleanQuery with > option BooleanClause.Occur.SHOULD. That should do the trick. > > > askarzaidi wrote: > > > > Hey Guys, > > > > Quick question: > > > > I do this in my code for searching: > > > > queryParser.setDefaultOperator(QueryParser.Operator.AND); > > > > Lucene is OR by default so I change it to AND for my requirements. Now, > I > > have a requirement to do OR as well. I mean while doing AND I'd like to > > include results from OR too ... but they'll be much lower ranked than > the > > AND results. > > > > Is there a way to do this ? > > > > thanks, > > AZ > > > > > > -- > View this message in context: > http://www.nabble.com/Do-AND-%2B-OR-Search-in-Lucene-tf4205268.html#a11962340 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: extracting non-english text from word, pdf, etc....??
Hey Michael, Have you given it a try? I would think they would work, but haven't actually done it. Setup a small test that reads in a PDF in French or Spanish and give it a try. You might have to worry about encodings or something, but the structure of the files should be the same, i.e. they are valid Word, etc. documents. -Grant On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote: Yea, I have seen those. I guess the question is what do you all use to extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and so on? This is what I use now to extract english. Thanks, Michael testn wrote: If you can extract token stream from those files already, you can simply use different analyzers to analyze those token stream appropriately. Check out Lucen-contrib analyzers at http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/ analyzers/src/java/org/apache/lucene/analysis/ heybluez wrote: I know how to do english text with POI and PDFBox and so on. Now, I want to start indexing non-english language such as french and spanish. Which extraction libs are available for me? I want to do: Excel Word PowerPoint PDF HTML RTF Thanks! Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: extracting non-english text from word, pdf, etc....??
In terms of PDF documents... PDFBox should work just fine with any latin based languages; at this time certain PDFs that have CJK characters can pose some issues. In general english/french/spanish should be fine. Some PDFs use custom encodings that make it impossible to extract text and it comes out as gibberish. As a simple test if Acrobat can extract the text then PDFBox should be able to as well. Ben Quoting Grant Ingersoll <[EMAIL PROTECTED]>: Hey Michael, Have you given it a try? I would think they would work, but haven't actually done it. Setup a small test that reads in a PDF in French or Spanish and give it a try. You might have to worry about encodings or something, but the structure of the files should be the same, i.e. they are valid Word, etc. documents. -Grant On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote: Yea, I have seen those. I guess the question is what do you all use to extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and so on? This is what I use now to extract english. Thanks, Michael testn wrote: If you can extract token stream from those files already, you can simply use different analyzers to analyze those token stream appropriately. Check out Lucen-contrib analyzers at http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ heybluez wrote: I know how to do english text with POI and PDFBox and so on. Now, I want to start indexing non-english language such as french and spanish. Which extraction libs are available for me? I want to do: Excel Word PowerPoint PDF HTML RTF Thanks! Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
Hi, It's been a couple of days now and I haven't heard anything on this topic, while there has been substantial list traffic otherwise. Am I asking in the wrong place? Was I unclear? I know there are people out there that have used/are using Lucene in a clustered environment. I am just looking for any sort of feedback (general or specific) about clustering lucene as well as filesystem compatibility (windows shares, NFS, etc.). Thanks again, -Zach Zach Bailey wrote: Hello all, First a little background - we are developing a clustered application that will in part leverage Lucene to provide index and search capabilities. We have already spent time investigating various index storage implementations (database vs. filesystem) and we've decided for performance reasons to go with a filesystem index storage scheme. That said, I have read back through the archives a bit and noticed that the support for index storage on NFS is still experimental (e.g. the latest bugfixes have not made it out to an official, stable release). I realize most of the issues related to using a shared file system revolve around locking, and I haven't seen much about the maturity of locking for other network filesystems. I was wondering if anyone has tried any other networked filesystems or had any recommendations. We have clients who would be doing this on both Windows and Unix/Linux so any insight there would be appreciated as well - it can be assumed that across any cluster the operating system use would be homogeneous (i.e. all nodes are on windows and would use windows shares, or all nodes are on linux and would use xyz filesystem). Thanks in advance, -Zach Bailey - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
Why don't you check out Hadoop and Nutch? It should provide what you are looking for. Zach Bailey wrote: > > Hi, > > It's been a couple of days now and I haven't heard anything on this > topic, while there has been substantial list traffic otherwise. > > Am I asking in the wrong place? Was I unclear? > > I know there are people out there that have used/are using Lucene in a > clustered environment. I am just looking for any sort of feedback > (general or specific) about clustering lucene as well as filesystem > compatibility (windows shares, NFS, etc.). > > Thanks again, > -Zach > > Zach Bailey wrote: >> Hello all, >> >> First a little background - we are developing a clustered application >> that will in part leverage Lucene to provide index and search >> capabilities. We have already spent time investigating various index >> storage implementations (database vs. filesystem) and we've decided for >> performance reasons to go with a filesystem index storage scheme. >> >> That said, I have read back through the archives a bit and noticed that >> the support for index storage on NFS is still experimental (e.g. the >> latest bugfixes have not made it out to an official, stable release). I >> realize most of the issues related to using a shared file system revolve >> around locking, and I haven't seen much about the maturity of locking >> for other network filesystems. >> >> I was wondering if anyone has tried any other networked filesystems or >> had any recommendations. We have clients who would be doing this on both >> Windows and Unix/Linux so any insight there would be appreciated as well >> - it can be assumed that across any cluster the operating system use >> would be homogeneous (i.e. all nodes are on windows and would use >> windows shares, or all nodes are on linux and would use xyz filesystem). >> >> Thanks in advance, >> -Zach Bailey >> > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Clustered-Indexing-on-common-network-filesystem-tf4194135.html#a11966423 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
Thanks for your response -- Based on my understanding, hadoop and nutch are essentially the same thing, with nutch being derived from hadoop, and are primarily intended to be standalone applications. We are not looking for a standalone application, rather we must use a framework to implement search inside our current content management application. Currently the application search functionality is designed and built around Lucene, so migrating frameworks at this point is not feasible. We are currently re-working our back-end to support clustering (in tomcat) and we are looking for information on the migration of Lucene from a single node filesystem index (which is what we use now and hope to continue to use for clients with a single-node deployment) to a shared filesystem index on a mounted network share. We prefer to use this strategy because it means we do not have to have two disparate methods of managing indexes for clients who run in a single-node, non-clustered environment versus clients who run in a multiple-node, clustered environment. So, hopefully here are some easy questions someone could shed some light on: Is this not a recommended method of managing indexes across multiple nodes? At this point would people recommend storing an individual index on each node and propagating index updates via a JMS framework rather than attempting to handle it transparently with a single shared index? Is the Lucene index code so intimately tied to filesystem semantics that using a shared/networked file system is infeasible at this point in time? What would be the quickest time-to-implementation of these strategies (JMS vs. shared FS)? The most robust/least error-prone? I really appreciate any insight or response anyone can provide, even if it is a short answer to any of the related topics, "i.e. we implemented clustered search using per-node indexing with JMS update propagation and it works great", or even something as simple as "don't use a shared filesystem at this point". Cheers, -Zach testn wrote: Why don't you check out Hadoop and Nutch? It should provide what you are looking for. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
Some quick info: NFS should work, but I think youll want to be working off the trunk. Also, Sharing an index over NFS is supposed to be slow. The standard so far, if you are not partitioning the index, is to use a unix/linux filesystem and hardlinks + rsync to efficiently share index changes across nodes (hard links for instant copy, rsync to only transfer changed index files, search the mailing list). If you look at solr you can see scripts that give an example of this. I don't think the scripts rely on solr. This kind of setup should be quick and simple to implement. Same with NFS. An RMI solution that allowed for index partitioning would probably be the longest to do. -Mark Zach Bailey wrote: Thanks for your response -- Based on my understanding, hadoop and nutch are essentially the same thing, with nutch being derived from hadoop, and are primarily intended to be standalone applications. We are not looking for a standalone application, rather we must use a framework to implement search inside our current content management application. Currently the application search functionality is designed and built around Lucene, so migrating frameworks at this point is not feasible. We are currently re-working our back-end to support clustering (in tomcat) and we are looking for information on the migration of Lucene from a single node filesystem index (which is what we use now and hope to continue to use for clients with a single-node deployment) to a shared filesystem index on a mounted network share. We prefer to use this strategy because it means we do not have to have two disparate methods of managing indexes for clients who run in a single-node, non-clustered environment versus clients who run in a multiple-node, clustered environment. So, hopefully here are some easy questions someone could shed some light on: Is this not a recommended method of managing indexes across multiple nodes? At this point would people recommend storing an individual index on each node and propagating index updates via a JMS framework rather than attempting to handle it transparently with a single shared index? Is the Lucene index code so intimately tied to filesystem semantics that using a shared/networked file system is infeasible at this point in time? What would be the quickest time-to-implementation of these strategies (JMS vs. shared FS)? The most robust/least error-prone? I really appreciate any insight or response anyone can provide, even if it is a short answer to any of the related topics, "i.e. we implemented clustered search using per-node indexing with JMS update propagation and it works great", or even something as simple as "don't use a shared filesystem at this point". Cheers, -Zach testn wrote: Why don't you check out Hadoop and Nutch? It should provide what you are looking for. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How do YOU detect corrupt indexes?
Hello, I've been asked to devise some way to discover and correct data in Lucene indexes that have been "corrupted." The word "corrupt", in this case, has a few different meanings, some of which strike me as exceedingly difficult to grok. What concerns me are the cases where we don't know that an index has been changed: A bit error in a stored field, for instance, is a form of corruption that we (ideally) should be able to identify, at the very least, and hopefully correct. This case in particular seems particularly onerous, since this isn't going to throw an exception of any sort, any time. We have a fairly good handle on how to remedy problems that throw exceptions, so we should be alright with corruption where (say) an operator logs in and overwrites a file. I'm wondering how other Lucene users have tackled this problem in the past. Calculating checksums on the documents seems like one way to go about it: compute a checksum on the document and, in a background thread, compare the checksum to the data. Unfortunately we're building a large, federated system and it would take months to exhaustively check every document this way. Checksumming the files themselves might be too much: We're storing gigabytes of data per index and there is some churn to the data; in other words, the overhead for this method might be too high. Thanks for any help you might have. -Joseph Rose Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. http://tv.yahoo.com/collections/222 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
One more alternative, though I am not sure if anyone is using it. Apache Compass has added a plug-in to allow storing Lucene index files inside the database. This should work in clustered environment as all nodes will share the same database instance. I am not sure the impact it will have on performance. Is anyone using DB for index storage? Any drawbacks of this approach? Regards, Rajesh --- Zach Bailey <[EMAIL PROTECTED]> wrote: > Thanks for your response -- > > Based on my understanding, hadoop and nutch are > essentially the same > thing, with nutch being derived from hadoop, and are > primarily intended > to be standalone applications. > > We are not looking for a standalone application, > rather we must use a > framework to implement search inside our current > content management > application. Currently the application search > functionality is designed > and built around Lucene, so migrating frameworks at > this point is not > feasible. > > We are currently re-working our back-end to support > clustering (in > tomcat) and we are looking for information on the > migration of Lucene > from a single node filesystem index (which is what > we use now and hope > to continue to use for clients with a single-node > deployment) to a > shared filesystem index on a mounted network share. > > We prefer to use this strategy because it means we > do not have to have > two disparate methods of managing indexes for > clients who run in a > single-node, non-clustered environment versus > clients who run in a > multiple-node, clustered environment. > > So, hopefully here are some easy questions someone > could shed some light on: > > Is this not a recommended method of managing indexes > across multiple nodes? > > At this point would people recommend storing an > individual index on each > node and propagating index updates via a JMS > framework rather than > attempting to handle it transparently with a single > shared index? > > Is the Lucene index code so intimately tied to > filesystem semantics that > using a shared/networked file system is infeasible > at this point in time? > > What would be the quickest time-to-implementation of > these strategies > (JMS vs. shared FS)? The most robust/least > error-prone? > > I really appreciate any insight or response anyone > can provide, even if > it is a short answer to any of the related topics, > "i.e. we implemented > clustered search using per-node indexing with JMS > update propagation and > it works great", or even something as simple as > "don't use a shared > filesystem at this point". > > Cheers, > -Zach > > testn wrote: > > Why don't you check out Hadoop and Nutch? It > should provide what you are > > looking for. > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
Mark, Thanks so much for your response. Unfortunately, I am not sure the leader of the project would feel good about running code from trunk, save without an explicit endorsement from a majority of the developers or contributors for that particular code (do those people keep up with this list, anyway?) Is there any word on the possible timeframe the code required to work with NFS might be released? Thanks for your other insight about hardlinks and rsync. I will look into that; unfortunately it does not cover our userbase who may be clustering in a Windows Server environment. I still have not heard/seen any evidence (anecdotal or otherwise) about how well lucene might work sharing indexes over a mounted Windows share. -Zach Mark Miller wrote: Some quick info: NFS should work, but I think youll want to be working off the trunk. Also, Sharing an index over NFS is supposed to be slow. The standard so far, if you are not partitioning the index, is to use a unix/linux filesystem and hardlinks + rsync to efficiently share index changes across nodes (hard links for instant copy, rsync to only transfer changed index files, search the mailing list). If you look at solr you can see scripts that give an example of this. I don't think the scripts rely on solr. This kind of setup should be quick and simple to implement. Same with NFS. An RMI solution that allowed for index partitioning would probably be the longest to do. -Mark Zach Bailey wrote: Thanks for your response -- Based on my understanding, hadoop and nutch are essentially the same thing, with nutch being derived from hadoop, and are primarily intended to be standalone applications. We are not looking for a standalone application, rather we must use a framework to implement search inside our current content management application. Currently the application search functionality is designed and built around Lucene, so migrating frameworks at this point is not feasible. We are currently re-working our back-end to support clustering (in tomcat) and we are looking for information on the migration of Lucene from a single node filesystem index (which is what we use now and hope to continue to use for clients with a single-node deployment) to a shared filesystem index on a mounted network share. We prefer to use this strategy because it means we do not have to have two disparate methods of managing indexes for clients who run in a single-node, non-clustered environment versus clients who run in a multiple-node, clustered environment. So, hopefully here are some easy questions someone could shed some light on: Is this not a recommended method of managing indexes across multiple nodes? At this point would people recommend storing an individual index on each node and propagating index updates via a JMS framework rather than attempting to handle it transparently with a single shared index? Is the Lucene index code so intimately tied to filesystem semantics that using a shared/networked file system is infeasible at this point in time? What would be the quickest time-to-implementation of these strategies (JMS vs. shared FS)? The most robust/least error-prone? I really appreciate any insight or response anyone can provide, even if it is a short answer to any of the related topics, "i.e. we implemented clustered search using per-node indexing with JMS update propagation and it works great", or even something as simple as "don't use a shared filesystem at this point". Cheers, -Zach testn wrote: Why don't you check out Hadoop and Nutch? It should provide what you are looking for. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
Rajesh, I forgot to mention this, but we did investigate this option as well and even prototyped it for an internal project. It ended up being too slow for us. It was adding a lot of overhead even to small updates, IIRC, mainly due to the fact that the index was essentially stored as a filesystem in the database. As you can probably imagine, using a database as a filesystem is not very performant. Rajesh parab wrote: One more alternative, though I am not sure if anyone is using it. Apache Compass has added a plug-in to allow storing Lucene index files inside the database. This should work in clustered environment as all nodes will share the same database instance. I am not sure the impact it will have on performance. Is anyone using DB for index storage? Any drawbacks of this approach? Regards, Rajesh --- Zach Bailey <[EMAIL PROTECTED]> wrote: Thanks for your response -- Based on my understanding, hadoop and nutch are essentially the same thing, with nutch being derived from hadoop, and are primarily intended to be standalone applications. We are not looking for a standalone application, rather we must use a framework to implement search inside our current content management application. Currently the application search functionality is designed and built around Lucene, so migrating frameworks at this point is not feasible. We are currently re-working our back-end to support clustering (in tomcat) and we are looking for information on the migration of Lucene from a single node filesystem index (which is what we use now and hope to continue to use for clients with a single-node deployment) to a shared filesystem index on a mounted network share. We prefer to use this strategy because it means we do not have to have two disparate methods of managing indexes for clients who run in a single-node, non-clustered environment versus clients who run in a multiple-node, clustered environment. So, hopefully here are some easy questions someone could shed some light on: Is this not a recommended method of managing indexes across multiple nodes? At this point would people recommend storing an individual index on each node and propagating index updates via a JMS framework rather than attempting to handle it transparently with a single shared index? Is the Lucene index code so intimately tied to filesystem semantics that using a shared/networked file system is infeasible at this point in time? What would be the quickest time-to-implementation of these strategies (JMS vs. shared FS)? The most robust/least error-prone? I really appreciate any insight or response anyone can provide, even if it is a short answer to any of the related topics, "i.e. we implemented clustered search using per-node indexing with JMS update propagation and it works great", or even something as simple as "don't use a shared filesystem at this point". Cheers, -Zach testn wrote: Why don't you check out Hadoop and Nutch? It should provide what you are looking for. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustered Indexing on common network filesystem
I have been meaning to write up a Wiki page on this general topic but have not quite made time yet ... Sharing an index with a shared filesystem will work, however there are some caveats: * This is somewhat unchartered territory because it's fairly recent fixes to Lucene that have enabled the things below to work, and, it's not a heavily tested area. Please share your experience so we all can learn... * If the filesystem does not protect against deletion of open files (notably NFS does not, however SMB/CIFS does) then you will need to create a custom DeletionPolicy based on your app logic so writer & readers "agree" on when it's safe to delete prior commit points. This can be something simple like "readers always refresh at least once per hour so any commit point older than 1 hour may be safely deleted". * Locking: if your app can ensure only one writer is active at a time, you can disable locking in Lucene entirely. Else, it's best to use NativeFSLockFactory, if you can. * If you are using a filesystem that does not have coherent caching on directory listing (NFS clients often do not), and, different nodes can "become" the writer (vs a single dedicated writer node) then there is one known open issue that you'll hit once you make your own DeletionPolicy which I still have to port to trunk: http://issues.apache.org/jira/browse/LUCENE-948 But as Mark said, performance is likely quite poor and so you may want to take an approach like Solr (or, use Solr) whereby a single writer makes changes to the index. Then these changes are efficiently propagated to multiple hosts (hard link & rsync is one way but not the only way), and these hosts then search their private copy via their local filesystem. Mike "Zach Bailey" <[EMAIL PROTECTED]> wrote: > Mark, > > Thanks so much for your response. > > Unfortunately, I am not sure the leader of the project would feel good > about running code from trunk, save without an explicit endorsement from > a majority of the developers or contributors for that particular code > (do those people keep up with this list, anyway?) Is there any word on > the possible timeframe the code required to work with NFS might be > released? > > Thanks for your other insight about hardlinks and rsync. I will look > into that; unfortunately it does not cover our userbase who may be > clustering in a Windows Server environment. I still have not heard/seen > any evidence (anecdotal or otherwise) about how well lucene might work > sharing indexes over a mounted Windows share. > > -Zach > > Mark Miller wrote: > > Some quick info: > > > > NFS should work, but I think youll want to be working off the trunk. > > Also, Sharing an index over NFS is supposed to be slow. The standard so > > far, if you are not partitioning the index, is to use a unix/linux > > filesystem and hardlinks + rsync to efficiently share index changes > > across nodes (hard links for instant copy, rsync to only transfer > > changed index files, search the mailing list). If you look at solr you > > can see scripts that give an example of this. I don't think the scripts > > rely on solr. This kind of setup should be quick and simple to > > implement. Same with NFS. An RMI solution that allowed for index > > partitioning would probably be the longest to do. > > > > -Mark > > > > > > > > Zach Bailey wrote: > >> Thanks for your response -- > >> > >> Based on my understanding, hadoop and nutch are essentially the same > >> thing, with nutch being derived from hadoop, and are primarily > >> intended to be standalone applications. > >> > >> We are not looking for a standalone application, rather we must use a > >> framework to implement search inside our current content management > >> application. Currently the application search functionality is > >> designed and built around Lucene, so migrating frameworks at this > >> point is not feasible. > >> > >> We are currently re-working our back-end to support clustering (in > >> tomcat) and we are looking for information on the migration of Lucene > >> from a single node filesystem index (which is what we use now and hope > >> to continue to use for clients with a single-node deployment) to a > >> shared filesystem index on a mounted network share. > >> > >> We prefer to use this strategy because it means we do not have to have > >> two disparate methods of managing indexes for clients who run in a > >> single-node, non-clustered environment versus clients who run in a > >> multiple-node, clustered environment. > >> > >> So, hopefully here are some easy questions someone could shed some > >> light on: > >> > >> Is this not a recommended method of managing indexes across multiple > >> nodes? > >> > >> At this point would people recommend storing an individual index on > >> each node and propagating index updates via a JMS framework rather > >> than attempting to ha
Re: Clustered Indexing on common network filesystem
"Zach Bailey" <[EMAIL PROTECTED]> wrote: > Unfortunately, I am not sure the leader of the project would feel good > about running code from trunk, save without an explicit endorsement from > a majority of the developers or contributors for that particular code > (do those people keep up with this list, anyway?) Is there any word on > the possible timeframe the code required to work with NFS might be > released? This person does keep up with the list :) On timframe ... there are tentative discussions now on the dev list on releasing 2.3 in a few months time, but by no means is this a hard schedule. I'll make sure LUCENE-948 is included in 2.3. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting only the Ids, not the whole documents.
If you are just retrieving your custom id and you have more stored fields (and they are not tiny) you certainly do want to use a field selector. I would suggest SetBasedFieldSelector. - Mark testn wrote: Hi, Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an ability to help you load only the first field in the document without loading all the fields. After that, you can keep the whole document if you like. It should help improve performance better. is_maximum wrote: yes it decrease the performance but the only solution. I've spent many weeks to find best way to retrive my own IDs but find this way as last one now I am storing the ids in a BitSet structure and it's fast enough public void collect(...){ idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID"))); } On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: Hi, The solution you suggested will definitely work but will definitely slow down my search by an order of magnitude. The problem I am trying to solve is performance, thats why I need the collection of IDs and not the whole documents. - thanks for the prompt reply. is_maximum wrote: yes if you extend your class from HitCollector and override the collect() mthod with following signature you can get IDs public void collect(int id, float score) On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: Hi all, Can I get just a list of document Ids given a search criteria ? To elaborate here is my situation: I store 2 contracts in the file system index each with some parameterName and Value. Given a search criterion - (paramValue='draft'). I need to get just an ArrayList of Strings containing contract Ids. I dont need the lucene documents, just the Ids. Can this be done ? -thanks -- View this message in context: http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ -- View this message in context: http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting only the Ids, not the whole documents.
On Thursday 02 August 2007 19:28:48 Mohammad Norouzi wrote: > you should not store them in an Array structure since they will take up the > memory. > the BitSet is the best structure to store them You can't store strings in a BitSet. What I would do is return a List but make a custom subclass of AbstractList which creates the strings on demand from the Hits object. We use similar tricks to convert Hits into a List of another more complex object type and it works great. You can cache the strings as they're retrieved if you're planning to use some strings much more than others. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I do boosting based on term postions?
I am doing implementation of SpanTermQuery for you, give me today. Sorry, I was out for meetings for 2 days. Enjoy, Shailendra On 8/3/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > Hi Paul, > > Isn't SpanFirstQuery only match those with position less than a > certain end position? > > I am rather looking for a query that would score a document higher for > terms appear near the start but not totally discard those with terms > appear near the end. > > Regards, > Cedric > > On 8/2/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > Cedric, > > > > SpanFirstQuery could be a solution without payloads. > > You may want to give it your own Similarity.sloppyFreq() . > > > > Regards, > > Paul Elschot > > > > On Thursday 02 August 2007 04:07, Cedric Ho wrote: > > > Thanks for the quick response =) > > > > > > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote: > > > > Yes, it is easily doable through "Payload" facility. During indexing > > process > > > > (mainly tokenization), you need to push this extra information in > each > > > > token. And then you can use BoostingTermQuery for using Payload > value to > > > > include Payload in the score. You also need to implement Similarity > for > > this > > > > (mainly scorePayload method). > > > > > > If I store, say a custom boost factor as Payload, does it means that I > > > will store one more byte per term per document in the index file? So > > > the index file would be much larger? > > > > > > > > > > > Other way can be to extend SpanTermQuery, this already calculates > the > > > > position of match. You just need to do something to use this > position > > value > > > > in the score calculation. > > > > > > I see that SpanTermQuery takes a TermPositions from the indexReader > > > and I can get the term position from there. However I am not sure how > > > to incorporate it into the score calculation. Would you mind give a > > > little more detail on this? > > > > > > > > > > > One possible advantage of SpanTermQuery approach is that you can > play > > > > around, without re-creating indices everytime. > > > > > > > > Thanks, > > > > Shailendra Sharma, > > > > CTO, Ver se' Innovation Pvt. Ltd. > > > > Bangalore, India > > > > > > > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > > > > > > > > > Hi all, > > > > > > > > > > I was wondering if it is possible to do boosting by search terms' > > > > > position in the document. > > > > > > > > > > for example: > > > > > search terms appear in the first 100 words, or first 10% words, or > in > > > > > first two paragraphs would be given higher score. > > > > > > > > > > Is it achievable through using the new Payload function in lucene > 2.2? > > > > > Or are there any easier ways to achieve these ? > > > > > > > > > > > > > > > Regards, > > > > > Cedric > > > > > > > > > > > - > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > Thanks, > > > Cedric > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > -- > [EMAIL PROTECTED] >
Re: Can I do boosting based on term postions?
Hi Paul, Isn't SpanFirstQuery only match those with position less than a certain end position? I am rather looking for a query that would score a document higher for terms appear near the start but not totally discard those with terms appear near the end. Regards, Cedric On 8/2/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > Cedric, > > SpanFirstQuery could be a solution without payloads. > You may want to give it your own Similarity.sloppyFreq() . > > Regards, > Paul Elschot > > On Thursday 02 August 2007 04:07, Cedric Ho wrote: > > Thanks for the quick response =) > > > > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote: > > > Yes, it is easily doable through "Payload" facility. During indexing > process > > > (mainly tokenization), you need to push this extra information in each > > > token. And then you can use BoostingTermQuery for using Payload value to > > > include Payload in the score. You also need to implement Similarity for > this > > > (mainly scorePayload method). > > > > If I store, say a custom boost factor as Payload, does it means that I > > will store one more byte per term per document in the index file? So > > the index file would be much larger? > > > > > > > > Other way can be to extend SpanTermQuery, this already calculates the > > > position of match. You just need to do something to use this position > value > > > in the score calculation. > > > > I see that SpanTermQuery takes a TermPositions from the indexReader > > and I can get the term position from there. However I am not sure how > > to incorporate it into the score calculation. Would you mind give a > > little more detail on this? > > > > > > > > One possible advantage of SpanTermQuery approach is that you can play > > > around, without re-creating indices everytime. > > > > > > Thanks, > > > Shailendra Sharma, > > > CTO, Ver se' Innovation Pvt. Ltd. > > > Bangalore, India > > > > > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi all, > > > > > > > > I was wondering if it is possible to do boosting by search terms' > > > > position in the document. > > > > > > > > for example: > > > > search terms appear in the first 100 words, or first 10% words, or in > > > > first two paragraphs would be given higher score. > > > > > > > > Is it achievable through using the new Payload function in lucene 2.2? > > > > Or are there any easier ways to achieve these ? > > > > > > > > > > > > Regards, > > > > Cedric > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > Thanks, > > Cedric > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- [EMAIL PROTECTED]
Performance improvements using writer.delete vs reader.delete
Hi, We're considering to use the new IndexWriter.deleteDocuments call rather than the IndexReader.delete call. Are there any performance improvements that this may provide, other than the benefit of not having to switch between readers/writers? We've looked at LUCENE-565, but there's no clear view of performance enhancements over the old IndexReader call. Cheers, Andreas -- ATLASSIAN Our products help over 7,000 organisations in more than 88 countries to collaborate. http://www.atlassian.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance improvements using writer.delete vs reader.delete
Andreas Knecht wrote: > We're considering to use the new IndexWriter.deleteDocuments call rather > than the IndexReader.delete call. Are there any performance > improvements that this may provide, other than the benefit of not having > to switch between readers/writers? > > We've looked at LUCENE-565, but there's no clear view of performance > enhancements over the old IndexReader call. I think Yonik's comment in 565 holds here - http://issues.apache.org/jira/browse/LUCENE-565#action_12432155 - if your application is buffering deletes/updates and then batch the deletes you probably won't see a large improvement. But if your application does not buffer the deletes and does not batch them, then I believe moving to IndexWriter.delete() (and update()) should buy you performance improvement, because IndexWriter would now buffer the deletes for you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do YOU detect corrupt indexes?
Not sure how exactly understand corrupted indexes in the sense that could not read / use indexes or something else.. thanks DT www.ejinz.com EjinZ Search Engine - Original Message - From: "Doron Cohen" <[EMAIL PROTECTED]> To: Sent: Friday, August 03, 2007 1:03 AM Subject: Re: How do YOU detect corrupt indexes? What is the anticipated cause of corruption? Malicious? Hardware fault? This somewhat reminds of discussions in the list about encrypting the index. See LUCENE-737 and a discussion pointed by it. One of the opinions there was that encryption should be handled at a lower level (OS/FS). Wouldn't that hold here as well? Joe R wrote: Hello, I've been asked to devise some way to discover and correct datain Lucene indexes that have been "corrupted." The word "corrupt", in this case, has a few different meanings, some of which strike me as exceedingly difficult to grok. What concerns me are the cases where we don't know that an index has been changed: A bit error in a stored field, for instance, is a form of corruption that we (ideally) should be able to identify, at the very least, and hopefully correct. This case in particular seems particularly onerous, since this isn't going to throw an exception of any sort, any time. We have a fairly good handle on how to remedy problems that throw exceptions, so we should be alright with corruption where (say) an operator logs in and overwrites a file. I'm wondering how other Lucene users have tackled this problem in the past. Calculating checksums on the documents seems like one way to goabout it: compute a checksum on the document and, in a background thread, compare the checksum to the data. Unfortunately we're building a large, federated system and it would take months to exhaustively check every document this way. Checksumming the files themselves might be too much: We're storing gigabytes of data per index and there is some churn to the data; in other words, the overhead for this method might be too high. Thanks for any help you might have. -Joseph Rose - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do YOU detect corrupt indexes?
What is the anticipated cause of corruption? Malicious? Hardware fault? This somewhat reminds of discussions in the list about encrypting the index. See LUCENE-737 and a discussion pointed by it. One of the opinions there was that encryption should be handled at a lower level (OS/FS). Wouldn't that hold here as well? Joe R wrote: > > Hello, > > I've been asked to devise some way to discover and correct datain Lucene > indexes that have been "corrupted." The word "corrupt", in > this case, has a > few different meanings, some of which strike me as exceedingly > difficult to > grok. What concerns me are the cases where we don't know that > an index has > been changed: A bit error in a stored field, for instance, is a form of > corruption that we (ideally) should be able to identify, at the > very least, and > hopefully correct. This case in particular seems particularly > onerous, since > this isn't going to throw an exception of any sort, any time. > > We have a fairly good handle on how to remedy problems that > throw exceptions, > so we should be alright with corruption where (say) an operator > logs in and > overwrites a file. > > I'm wondering how other Lucene users have tackled this problem > in the past. > Calculating checksums on the documents seems like one way to goabout it: > compute a checksum on the document and, in a background thread, > compare the > checksum to the data. Unfortunately we're building a large, > federated system > and it would take months to exhaustively check every document this way. > Checksumming the files themselves might be too much: We're > storing gigabytes of > data per index and there is some churn to the data; in other words, the > overhead for this method might be too high. > > Thanks for any help you might have. > > > -Joseph Rose - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do YOU detect corrupt indexes?
On Friday 03 August 2007 16:03:22 Doron Cohen wrote: > What is the anticipated cause of corruption? Malicious? > Hardware fault? This somewhat reminds of discussions in > the list about encrypting the index. See LUCENE-737 > and a discussion pointed by it. One of the opinions > there was that encryption should be handled at a lower > level (OS/FS). Wouldn't that hold here as well? That's actually a good point. These days we have filesystems like ZFS which check for corruption automatically. This should remove a lot of the extra digesting work people would otherwise need to do to ensure consistency. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]