My application needs to simultaneously process record additions and updates with one pass through a database. That's not in itself a problem: I open an IndexReader on the existing index to mark the prior versions of updated records as deleted Documents, and an IndexWriter on a new empty index to accept the new and updated records as new Documents. Then it's simply a matter of merging the two indexes into one.

One possibility is to directly merge from the IndexReader (existing index) into the IndexWriter (new index). The problem is that it takes a very long time, on the order of several hours, since it must replicate everything in the old index (2.7Gb, 2.2M Documents) in the new one. Since the update consists of a small number of records, typically under 1000 and sometimes just a handful, this is a waste of cycles.

Anticipating that, my code reverses the relationship, closing the index objects at the end of the loop then opening an IndexWriter on the existing (target) index and an IndexReader on the index which has the new records, which gets merged into the target. It functions as expected and desired, but is also far more time-consuming than reasonable.

For example, if I take an empty (no Document) or small (100 Documents) index and merge one or other via IndexWriter.addIndexes() into an already optimized single-segment index, what should in theory be a null or trivial operation ends up in practice unpacking the destination index into many segments, and then repacking them into a single segment. For that 2.7Gb index it takes over an hour.

In contrast, it only a few seconds to add those same 100 Documents directly to the index via addDocument(), as long as one doesn't optimize.

So... I notice that both IndexWriter.addIndexes(...) merge methods start and end with calls to optimize() on the target index. I'm not sure whether that is causing the unpacking and repacking I observe, but it does wonder whether they truly need to be there:

- Is starting off with zero or 1 segment (per a comment in the source code) truly a precondition for successful merging?

- Since one can open an unoptimized index and add millions of Documents (creating potentially hundreds of segments) via addDocument, with optimization left entirely optional and up to the user, why is optimization required as a postcondition for merging?

Any advice on this or the general application design, would be appreciated.

Thanks,
J.J. Larrea

PS: This was tested against SVN trunk revision 329490

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to