Re: What is the right query syntax for matching some field's substring?

2009-04-01 Thread Bon
Hi Matt, Thanks for your answer, I'm new to lucene, so I don't know what should I know about that. I find a reference about discuss searching substring and it work good for me, I'm not sure what analyer we used, I'll check it out and make sure why it work for us. thank you ve

Re: What is the right query syntax for matching some field's substring?

2009-04-01 Thread Bon
Hi Matt, Thanks for your answer, I'm new to lucene, so I don't know what should I know about that. I find a reference about discuss searching substring and it work good for me, I'm not sure what analyer we used, I'll check it out and make sure why it work for us. thank you ve

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
a code snippet is worth 1000 words :) private static final Term UID_TERM = new Term("uid_payload", "_UID"); private static class SinglePayloadTokenStream extends TokenStream { private Token token = new Token(UID_TERM.text(), 0, 0); private byte[] buffer = new byte[4]; private boolean

Help to determine why an optimized index is proportionaly too big.

2009-04-01 Thread Dan OConnor
All: We are using java lucene 2.3.2 to index a fairly large number of documents (roughly 400,000 per day). We have divided the time history into various depths. Our first stage covers 8 days and our next stage covers 22. The index directory for the first stage is approximately 20G when fully op

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
On Wed, Apr 1, 2009 at 5:22 PM, John Wang wrote: > Hi Michael: > >    1) Yes, we use TermDocs, exactly what IndexWriter.deleteDocuments(Term) > is doing under the cover. This part I understand :) >    2) We iterate the docid->uid mapping, for each docid, get the > corresponding ui and check that

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
Hi Michael: 1) Yes, we use TermDocs, exactly what IndexWriter.deleteDocuments(Term) is doing under the cover. 2) We iterate the docid->uid mapping, for each docid, get the corresponding ui and check that to see if that is in the deleted set. If so, add the docid to the list. There is no ui

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
On Wed, Apr 1, 2009 at 2:04 PM, John Wang wrote: > My test essentially this. I took out the reader.deleteDocuments call from > both scenarios. I took a index of 5m docs. a batch of 1 randomly > generated uids. > > Compared the following scenarios: > 1) > * open index reader > * for each uid i

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
Thanks Michael for the info. I do guarantee there are not modifications between when "MySpecialIndexReader" is loaded and when I iterate and find the deleted docids. I am, however, not aware that when IndexWriter is opened, docids move. I thought only when docs are added and when it is committed.

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Jason Rutherglen
John, We looked at implementing delete by doc id for LUCENE-1516, however it seemed to be something that if enough people wanted we could implement it at as a later patch. The implementation involves maintaining a genealogy of SegmentReaders within IndexWriter so that deletes to a reader that has

Re: Search using MultiSearcher generates OOM on a 1GB total Partitioned indeces

2009-04-01 Thread Erick Erickson
Think about putting this query in Luke and doing an "explain" for details, but I'm surprised this is working at all without throwing TooManyClauses errors. Under the covers, Lucene expands your wildcards to all terms in the field that match. For instance, assume your document field has the fol

Search using MultiSearcher generates OOM on a 1GB total Partitioned indeces

2009-04-01 Thread Lebiram
Hi All, I have the following query on a 1GB index with about 12 million docs : As you can see the terms consist of wildcards... query.toString()=+(+content:g* +content:h* +content:d* +content:s* +content:a* +content:w* +content:b* +content:c* +content:m* +content:e*) +((+sender:cpuser9 +viewer

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
> For me at lease, IndexWriter.deleteDocument(int) would be useful. I completely agree: delete-by-docID in IndexWriter would be a great feature. Long ago I became convinced of that. Where this feature always gets stuck (search the lists -- it's gotten stuck alot) is how to implement it? At any

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Yonik Seeley
On Wed, Apr 1, 2009 at 4:02 AM, Michael McCandless wrote: > I think this has the same problem as exposing delete by docID, ie, how > would you produce that docIdSet? Whoops, right. I was going by memory that there was a get(IndexReader) type method there... but that's on Filter of course. -Yon

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
Hi Michael: Let me first share what I am doing w.r.t deleting by docid: I have a customized index reader that stores a mapping of docid -> uid in the payload (something Michael Bush and Ning Li suggested a while back) And that mapping is loaded a IndexReader load time and is shared by searche

Re: What is the right query syntax for matching some field's substring?

2009-04-01 Thread Matthew Hall
Which analyzer are you using here? Depending on your choice the comma separated values might be being kept together in your index, rather than tokenized as you expected. Secondly, you should get Luke, and take a look into your index, this should give you a much better idea of what's going on

Re: Unable to improve performance

2009-04-01 Thread Toke Eskildsen
On Fri, 2009-03-27 at 12:07 +0100, Paul Taylor wrote: [2Gb index, 7 million documents(?)] > I ran the test a number of times with 30 threads, and max memory of > 3500mb I was processing 10,000 records in about 43 seconds ( 233 > queries/second) , the index was stored on a solid state drive runn

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
John, I think this has the same problem as exposing delete by docID, ie, how would you produce that docIdSet? We could consider delete by Filter instead, since that exposes the necessary getDocIdSet(IndexReader) method. Or, with near real-time search, we could enhance it to allow deletions via t