Re: a faster way to addDocument and get the ID just added?

Li Li Tue, 29 Mar 2011 23:15:22 -0700

merge will also change docid
all segments' docId begin with 0

2011/3/30 Trejkaz <trej...@trypticon.org>:
> On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson
> <erickerick...@gmail.com> wrote:
>> I'm always skeptical of storing the doc IDs since they can
>> change out from underneath you (just delete even a single
>> document and optimize).
>
> We never delete documents.  Even when a feature request came in to
> update documents (i.e. delete the old one and add a new version), we
> ended up keeping the old version around, partially because we didn't
> want the IDs to shift (which is a bit of a recursive argument), but
> also because it's forensically sound to have the previous versions
> around so people can see what edits were made.
>
>> What is it you're doing with the doc ID that you couldn't do with the guid? 
>> If your "guid list"
>> were ordered, I can imagine building filters quite quickly from
>> it using TermDocs.skipTo for instance..
>
> The main problem with filters is that DocIdBitSet's iterator has to
> return the doc IDs in order.
>
> Even if our GUIDs are in order (they would be, as it would be the
> primary key on tables using them), they won't be in the same order as
> the IDs of the docs they came from.  So for each row in the ResultSet,
> you need to do a TermDocs.seek(Term).  This not only costs the
> additional I/O (and it's a lot more than the original database query
> was), but you have to read every row in the ResultSet just to get the
> first doc ID.
>
> Contrast this with using doc IDs for the database query.  You don't
> need to hit the index at all since you already have the result.  And
> the docs come back in order, so you don't even have to iterate the
> entire result set - you can read the first 100 rows and then read more
> rows if/when they are needed.  And if the caller is using skipTo then
> this can be incorporated into the database query to avoid returning
> rows which are only going to be discarded anyway.
>
> Integer fields should have improved things a little in terms of the
> amount of I/O required to do the query (at least I would hope that
> this is the case - I haven't done any tests yet and we can't use them
> yet for backwards compatibility reasons) but they don't remove the
> problem of needing to iterate every document in the result set
> up-front.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: a faster way to addDocument and get the ID just added?

Reply via email to