Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-17 Thread Michael McCandless
Daniel Noll wrote: On Monday 17 March 2008 19:38:46 Michael McCandless wrote: Well ... expungeDeletes() first forces a flush, at which point the deletions are flushed as a .del file against the just flushed segment. Still, if you call expungeDeletes after every flush (commit) then it's only 1

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-17 Thread Daniel Noll
On Monday 17 March 2008 19:38:46 Michael McCandless wrote: > Well ... expungeDeletes() first forces a flush, at which point the > deletions are flushed as a .del file against the just flushed > segment. Still, if you call expungeDeletes after every flush > (commit) then it's only 1 segment whose d

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-17 Thread Michael McCandless
Daniel Noll wrote: On Thursday 13 March 2008 19:46:20 Michael McCandless wrote: But, when a normal merge of segments with deletions completes, your docIDs will shift. In trunk we now explicitly compute the docID shifting that happens after a merge, because we don't always flush pending delete

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-16 Thread Daniel Noll
On Thursday 13 March 2008 19:46:20 Michael McCandless wrote: > But, when a normal merge of segments with deletions completes, your > docIDs will shift. In trunk we now explicitly compute the docID > shifting that happens after a merge, because we don't always flush > pending deletes when flushing

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Michael Busch
Daniel Noll wrote: For interest's sake I also timed fetching the document with no FieldSelector, that takes around 410ms for the same documents. So there is still a big benefit in using the field selector, it just isn't anywhere near enough to get it close to the time it takes to retrieve th

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
On Thu, Mar 13, 2008 at 9:30 PM, Doron Cohen <[EMAIL PROTECTED]> wrote: > Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). > I suspect this can be related to the problem you see though I am not sure. > Could you try with the patch there? > Thanks, > Doron Daniel, I was wrong about

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). I suspect this can be related to the problem you see though I am not sure. Could you try with the patch there? Thanks, Doron On Thu, Mar 13, 2008 at 10:46 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > Daniel Noll wrote: > >

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Michael McCandless
Daniel Noll wrote: On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote: OK, I think very likely this is the issue: when IndexWriter hits an exception while processing a document, the portion of the document already indexed is left in the index, and then its docID is marked for deletio

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Daniel Noll
On Thursday 13 March 2008 00:42:59 Erick Erickson wrote: > I certainly found that lazy loading changed my speed dramatically, but > that was on a particularly field-heavy index. > > I wonder if TermEnum/TermDocs would be fast enough on an indexed > (UN_TOKENIZED???) field for a unique id. > > Mostl

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Daniel Noll
On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote: > OK, I think very likely this is the issue: when IndexWriter hits an > exception while processing a document, the portion of the document > already indexed is left in the index, and then its docID is marked > for deletion. You can see

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Erick Erickson
I certainly found that lazy loading changed my speed dramatically, but that was on a particularly field-heavy index. I wonder if TermEnum/TermDocs would be fast enough on an indexed (UN_TOKENIZED???) field for a unique id. Mostly, I'm hoping you'll try this and tell me if it works so I don't have

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Michael McCandless
Daniel Noll wrote: I have filtered out lines in the log which indicated an exception adding the document; these occur when our Reader throws an IOException and there were so many that it bloated the file. OK, I think very likely this is the issue: when IndexWriter hits an exception whil

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Wednesday 12 March 2008 10:20:12 Michael McCandless wrote: > Oh, so you do not see the problem with SerialMergeScheduler but you > do with ConcurrentMergeScheduler? [...] > Oh, there are no deletions?  Then this is very strange.  Is it   > optimize that messes up the docIDs?  Or, is it when you

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Wednesday 12 March 2008 09:53:58 Erick Erickson wrote: > But to me, it always seems...er...fraught to even *think* about relying > on doc ids. I know you've been around the block with Lucene, but do you > have a compelling reason to use the doc ID and not your own unique ID? From memory it was

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Michael McCandless
Daniel Noll wrote: On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote: Hi Daniel, 2.3 should be no different from 2.2 in that docIDs only "shift" when a merge of segments with deletions completes. Could it be the ConcurrentMergeScheduler? Merges now run in the background by default

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Erick Erickson
But to me, it always seems...er...fraught to even *think* about relying on doc ids. I know you've been around the block with Lucene, but do you have a compelling reason to use the doc ID and not your own unique ID? Best Erick On Tue, Mar 11, 2008 at 5:39 PM, Daniel Noll <[EMAIL PROTECTED]> wrote:

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote: > Hi Daniel, > > 2.3 should be no different from 2.2 in that docIDs only "shift" when > a merge of segments with deletions completes. > > Could it be the ConcurrentMergeScheduler? Merges now run in the > background by default and commit w

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Michael McCandless
Hi Daniel, 2.3 should be no different from 2.2 in that docIDs only "shift" when a merge of segments with deletions completes. Could it be the ConcurrentMergeScheduler? Merges now run in the background by default and commit whenever they complete. You can get back to the previous (block

Re: document Id question, again

2008-01-31 Thread Michael McCandless
DocIDs change whenever segments that had deletes pending, get merged. So if you have no deletions, docIDs won't ever change. Mike Cam Bazz wrote: Hello; If no document is ever deleted nor updated from an index, will the document id change? under which circumstances will the document ids c

RE: Document ID

2005-06-25 Thread Chris Hostetter
: The simple question - I have a document and I add it into index with : TermVector support. : How can I simply retrive the TermVector information for the document? : : TermFreqVector vector = reader.getTermFreqVector(document)? : reader.delete(document); : Etc.. Open an IndexReader,

RE: Document ID

2005-06-25 Thread Pasha Bizhan
Hi, > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > For a domain-centric identifier, use a custom field to store > (and index perhaps?) it. Lucene's Document id's are internal > and not controllable. Unfortunately Lucene contains API that strongly attached to internal id :( For example -

Re: Document ID

2005-06-24 Thread Mario Ivankovits
Hi! Is there any way to force the document id inside the lucene index, if I have my own internal numbering scheme, it would be nice to have that reflected inside the lucene index...anyway? Simply put your ID as additional field to your document. You never should rely on lucenes document id as

Re: Document ID

2005-06-24 Thread Erik Hatcher
On Jun 24, 2005, at 3:08 PM, Yousef Ourabi wrote: Hello: Is there any way to force the document id inside the lucene index, if I have my own internal numbering scheme, it would be nice to have that reflected inside the lucene index...anyway? For a domain-centric identifier, use a custom field