Sean seriously a couple of hundred docs a second, don't bother just use updateDocument. My benchmarks show that there is only a smallish impact during indexing especially with concurrent flushing in lucene 4. I don't know how resource intensive your analysis chain is but on a decent machine you can easily go > 20k docs a second with updateDocument.
If you want to give deleteByDocid a try for kicks I'd be curious how you solve some of the really tricky issues! :) simon On Thu, Jul 12, 2012 at 10:08 PM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi Sean, > > Without checking the performance in your case, it makes no sense to discuss > about this. Lucene 4.0 changed a lot, there are several improvements. Please > read the following: > > - Because of the new term dictionary, Term lookups on non-existing terms are > fail-fast, they don't do any disk IO in most cases. You can do ten thousands > of those per second on a simple laptop. > - DocumentsWriter uses internal Lucene DocIDs, but those are not global and > therefore not useful for you. They are only valid for one index segment and > only temporarily until IndexWriter merges segments again (possibly in > another thread) > > So: Use updateDocument always when you put your new documents into the index > and give every document the unique ID from your pool. Document IDs of Lucene > are pure internal and especially in 4.0's IndexWriter no longer constant > (they can easily change after reopening an index depending on merge policy > or getting a new realtime reader). To uniquely identify documents later you > *have* to use a own key field. > > Lucene 4.0 is different than previous versions, deleting by internal Lucene > docId will not come back. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: Sean Bridges [mailto:sean.brid...@gmail.com] >> Sent: Thursday, July 12, 2012 9:51 PM >> To: java-user@lucene.apache.org; simon.willna...@gmail.com >> Subject: Re: delete by docid in lucene 4 >> >> I never used updateDocument() due to ignorance. >> >> We are indexing several hundred documents per second, and most of the >> analysis takes places on the non indexer machines to reduce load on the >> indexers. For our use case, deleteDocument(int docId) will be faster as > there >> are very few duplicates, but I don't know if the difference is > significant. >> >> It would be nice to have a deleteDocument(int docId) in IndexWriter. >> It seems like it would be easy to add as DocumentsWriter already has a >> deletedDocID. I can file a jira and submit a patch if this is something > that you >> guys would accept. >> >> Sean >> >> On Thu, Jul 12, 2012 at 11:53 AM, Simon Willnauer >> <simon.willna...@gmail.com> wrote: >> > On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.brid...@gmail.com> >> wrote: >> >> Thanks for the tip. >> >> >> >> Does using updateDocument instead of addDocument affect >> >> indexing/search performance? >> > >> > it does affect index performance compared to add document but that >> > might be minor compared to your analysis chain. I wouldn't worry about >> > updateDocument its the only sensible way to use lucene really. Why >> > didn't you use this before, any reason? What is your ingest rate / doc >> > throughput and where would you get concerned? >> > >> > simon >> >> >> >> Sean >> >> >> >> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <u...@thetaphi.de> wrote: >> >>> The trick is to index not with addDocument(Document) but instead >> >>> with updateDocument(Term, Document). Lucene then adds the document >> >>> atomically while deleting any previous documents with the given term >> >>> (which is qour unique ID). If the key does not exist it simply >> >>> indexes without deleting anything. >> >>> By this you always have only one document with the same Term (==your >> >>> unique ID). >> >>> >> >>> Uwe >> >>> >> >>> ----- >> >>> Uwe Schindler >> >>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de >> >>> eMail: u...@thetaphi.de >> >>> >> >>> >> >>>> -----Original Message----- >> >>>> From: Sean Bridges [mailto:sean.brid...@gmail.com] >> >>>> Sent: Thursday, July 12, 2012 5:42 PM >> >>>> To: java-user@lucene.apache.org; simon.willna...@gmail.com >> >>>> Subject: Re: delete by docid in lucene 4 >> >>>> >> >>>> We have indexer machines which are fed documents by other machines. >> >>>> If an error occurs (machine crashing etc) the same document may be > sent >> to >> >>> an >> >>>> indexer multiple times. Serial ids are assigned before documents > reach >> >>> the >> >>>> indexer, so a document, may be in the index multiple times, each time >> with >> >>> the >> >>>> same serial id. >> >>>> >> >>>> When the index gets large enough, the indexer will stop writing to > the >> >>> index, >> >>>> and upload it to another machine, which keeps the index forever. > Before >> >>> we >> >>>> upload the index, we forceMerge(1) on it, and gather some stats about >> the >> >>>> index like max,min serial id, total documents. While calculating max > and >> >>> min >> >>>> serial id, if we see a duplicate serial id, we call >> >>> IndexReader.deleteByDocId(...) . >> >>>> >> >>>> We could check for duplicate serial ids while indexing, but that is > racy, >> >>> and not >> >>>> as efficient. >> >>>> >> >>>> Thanks, >> >>>> >> >>>> Sean >> >>>> >> >>>> >> >>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer >> >>>> <simon.willna...@gmail.com> wrote: >> >>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges >> <sean.brid...@gmail.com> >> >>>> wrote: >> >>>> >> Is it possible to delete by docId in lucene 4? I can delete by > docid >> >>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that >> >>>> >> method is gone in lucene 4, and IndexWriter only allows deleting > by >> >>>> >> Term or Query. >> >>>> > >> >>>> > that is correct. In lucene 4 IndexReader is really just a reader! >> >>>> >> >> >>>> >> This is our use case - In our system, each document is identified > by >> >>>> >> a unique serial id. If an error occurs, we may index the same >> >>>> >> message multiple times. When an index grows large enough, we stop >> >>>> >> adding to it, and optimize the index. During optimization, if we > see >> >>>> >> multiple docs with the same serialid, we delete all but the first, > as >> >>>> >> all documents with the same serialid are the same. >> >>>> > >> >>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc) >> >>>> > method? do you rely on multiple versions of the same doc? With > Lucene >> >>>> > 4 relying on the doc id can become very tricky. If you use multiple >> >>>> > threads you create a lot of segments which can be merged in any > order. >> >>>> > You can't tell if a document ID maintains happened-before semantics > at >> >>>> > all. >> >>>> > >> >>>> > Can you tell us more about your usecase and why you are using >> >>>> > deleteByDocID >> >>>> > >> >>>> > simon >> >>>> > >> >>>> > >> >>>> >> >> >>>> >> Thanks, >> >>>> >> >> >>>> >> Sean >> >>>> >> >> >>>> >> > --------------------------------------------------------------------- >> >>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>>> >> >> >>>> > >> >>>> > > --------------------------------------------------------------------- >> >>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>>> > >> >>>> >> >>>> --------------------------------------------------------------------- >> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org