Antony Bowesman wrote:
I have a design where I will be using multiple index shards to hold approx 7.5 million documents per index per month over many years. These will be large static R/O indexes but the corresponding smaller parallel index will get many frequent changes.

I understand from previous replies by Hoss that the technique to handle this is to use parallel indexes where the parallel index gets rebuilt periodically with the changing data.

However, this 'periodically' needs to be quite frequent to try to provide responsive changes to the index, potentially several times a dat. One problem is that there can be updates to any of the data in almost any month, so an update by a user to 120 documents, one document per month for 10 years, requires a full rebuild of the 120 index shards of 7.5m docs each...

I was wondering what the technical reasons were why a 'delete+add' could not allow the original docId to be re-used, thus keeping the two parallel indexes in sync without requiring a rebuild.

If this could be overcome, this would make this parallel index pattern so much more useful for large volume data sets.

Any thoughts

I have a thought ;) Perhaps you could use a FilteredIndexReader to maintain a map between new IDs and old IDs, and remap on the fly. Although I think that some parts of Lucene depend on the fact that in a normal index the IDs are monotonically increasing ... this would complicate the issue.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to