Re: Merge performance

Michael D. Curtin Thu, 19 Apr 2007 11:46:21 -0700

david m wrote:

A couple of reasons that lead to the merge approach:


- Source documents are written to archive media and retrieval is
 relatively slow. Add to that our processing pipeline (including
 text extraction)... Retrieving and merging minis is faster than
 re-processing and re-indexing from sources.

- In addition to index recovery, mini indexes may be combined into
 custom indexes based on policy.

 From a compliance viewpoint the mini indexes contain logically
 related documents. For example: based on a retention policy,
 documents of type x are to be kept for y years.

 One example for constructing a custom index would be for legal
 discovery.

I see -- it sounds like the "minis" are there for several,application-specific reasons besides backup and recovery. Your schemesounds like it might be a clever leveraging of everything you did tomeet all those other requirements.

For the Lucene projects I've been on, the aggregate size of the sourcedata was about the same as the resulting indexes. In your case I'dguess that the aggregate size of the minis is somewhat larger than thefinal index, due to duplication of terms. Anyhow, in my projects,recovery is much faster from a backup of the (final) index than from abackup of upstream data followed by reprocessing. It sounds like you'vealready measured the relevant parameters, though, so maybe my projects'data sets have very different characteristics.


Good luck on your project!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Merge performance

Reply via email to