Re: delete by docid in lucene 4

Simon Willnauer Thu, 12 Jul 2012 15:18:06 -0700

Sean seriously a couple of hundred docs a second, don't bother just
use updateDocument. My benchmarks show that there is only a smallish
impact during indexing especially with concurrent flushing in lucene
4. I don't know how resource intensive your analysis chain is but on a
decent machine you can easily go > 20k docs a second with
updateDocument.


If you want to give deleteByDocid a try for kicks I'd be curious how
you solve some of the really tricky issues! :)

simon

On Thu, Jul 12, 2012 at 10:08 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> Hi Sean,
>
> Without checking the performance in your case, it makes no sense to discuss
> about this. Lucene 4.0 changed a lot, there are several improvements. Please
> read the following:
>
> - Because of the new term dictionary, Term lookups on non-existing terms are
> fail-fast, they don't do any disk IO in most cases. You can do ten thousands
> of those per second on a simple laptop.
> - DocumentsWriter uses internal Lucene DocIDs, but those are not global and
> therefore not useful for you. They are only valid for one index segment and
> only temporarily until IndexWriter merges segments again (possibly in
> another thread)
>
> So: Use updateDocument always when you put your new documents into the index
> and give every document the unique ID from your pool. Document IDs of Lucene
> are pure internal and especially in 4.0's IndexWriter no longer constant
> (they can easily change after reopening an index depending on merge policy
> or getting a new realtime reader). To uniquely identify documents later you
> *have* to use a own key field.
>
> Lucene 4.0 is different than previous versions, deleting by internal Lucene
> docId will not come back.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -----Original Message-----
>> From: Sean Bridges [mailto:sean.brid...@gmail.com]
>> Sent: Thursday, July 12, 2012 9:51 PM
>> To: java-user@lucene.apache.org; simon.willna...@gmail.com
>> Subject: Re: delete by docid in lucene 4
>>
>> I never used updateDocument() due to ignorance.
>>
>> We are indexing several hundred documents per second, and most of the
>> analysis takes places on the non indexer machines to reduce load on the
>> indexers.  For our use case, deleteDocument(int docId) will be faster as
> there
>> are very few duplicates, but I don't know if the difference is
> significant.
>>
>> It would be nice to have a deleteDocument(int docId) in IndexWriter.
>> It seems like it would be easy to add as DocumentsWriter already has a
>> deletedDocID.  I can file a jira and submit a patch if this is something
> that you
>> guys would accept.
>>
>> Sean
>>
>> On Thu, Jul 12, 2012 at 11:53 AM, Simon Willnauer
>> <simon.willna...@gmail.com> wrote:
>> > On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.brid...@gmail.com>
>> wrote:
>> >> Thanks for the tip.
>> >>
>> >> Does using updateDocument instead of addDocument affect
>> >> indexing/search performance?
>> >
>> > it does affect index performance compared to add document but that
>> > might be minor compared to your analysis chain. I wouldn't worry about
>> > updateDocument its the only sensible way to use lucene really. Why
>> > didn't you use this before, any reason? What is your ingest rate / doc
>> > throughput and where would you get concerned?
>> >
>> > simon
>> >>
>> >> Sean
>> >>
>> >> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <u...@thetaphi.de> wrote:
>> >>> The trick is to index not with addDocument(Document) but instead
>> >>> with updateDocument(Term, Document). Lucene then adds the document
>> >>> atomically while deleting any previous documents with the given term
>> >>> (which is qour unique ID). If the key does not exist it simply
>> >>> indexes without deleting anything.
>> >>> By this you always have only one document with the same Term (==your
>> >>> unique ID).
>> >>>
>> >>> Uwe
>> >>>
>> >>> -----
>> >>> Uwe Schindler
>> >>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
>> >>> eMail: u...@thetaphi.de
>> >>>
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: Sean Bridges [mailto:sean.brid...@gmail.com]
>> >>>> Sent: Thursday, July 12, 2012 5:42 PM
>> >>>> To: java-user@lucene.apache.org; simon.willna...@gmail.com
>> >>>> Subject: Re: delete by docid in lucene 4
>> >>>>
>> >>>> We have indexer machines which are fed documents by other machines.
>> >>>> If an error occurs (machine crashing etc) the same document may be
> sent
>> to
>> >>> an
>> >>>> indexer multiple times.  Serial ids are assigned before documents
> reach
>> >>> the
>> >>>> indexer, so a document, may be in the index multiple times, each time
>> with
>> >>> the
>> >>>> same serial id.
>> >>>>
>> >>>> When the index gets large enough, the indexer will stop writing to
> the
>> >>> index,
>> >>>> and upload it to another machine, which keeps the index forever.
> Before
>> >>> we
>> >>>> upload the index, we forceMerge(1) on it, and gather some stats about
>> the
>> >>>> index like max,min serial id, total documents.  While calculating max
> and
>> >>> min
>> >>>> serial id, if we see a duplicate serial id, we call
>> >>> IndexReader.deleteByDocId(...) .
>> >>>>
>> >>>> We could check for duplicate serial ids while indexing, but that is
> racy,
>> >>> and not
>> >>>> as efficient.
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Sean
>> >>>>
>> >>>>
>> >>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>> >>>> <simon.willna...@gmail.com> wrote:
>> >>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges
>> <sean.brid...@gmail.com>
>> >>>> wrote:
>> >>>> >> Is it possible to delete by docId in lucene 4?  I can delete by
> docid
>> >>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>> >>>> >> method is gone in lucene 4, and IndexWriter only allows deleting
> by
>> >>>> >> Term or Query.
>> >>>> >
>> >>>> > that is correct. In lucene 4 IndexReader is really just a reader!
>> >>>> >>
>> >>>> >> This is our use case -  In our system, each document is identified
> by
>> >>>> >> a unique serial id.  If an error occurs, we may index the same
>> >>>> >> message multiple times.  When an index grows large enough, we stop
>> >>>> >> adding to it, and optimize the index.  During optimization, if we
> see
>> >>>> >> multiple docs with the same serialid, we delete all but the first,
> as
>> >>>> >> all documents with the same serialid are the same.
>> >>>> >
>> >>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>> >>>> > method? do you rely on multiple versions of the same doc? With
> Lucene
>> >>>> > 4 relying on the doc id can become very tricky. If you use multiple
>> >>>> > threads you create a lot of segments which can be merged in any
> order.
>> >>>> > You can't tell if a document ID maintains happened-before semantics
> at
>> >>>> > all.
>> >>>> >
>> >>>> > Can you tell us more about your usecase and why you are using
>> >>>> > deleteByDocID
>> >>>> >
>> >>>> > simon
>> >>>> >
>> >>>> >
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >>
>> >>>> >> Sean
>> >>>> >>
>> >>>> >>
> ---------------------------------------------------------------------
>> >>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>>> >>
>> >>>> >
>> >>>> >
> ---------------------------------------------------------------------
>> >>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>>> >
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: delete by docid in lucene 4

Reply via email to