david m wrote:

A couple of reasons that lead to the merge approach:

- Source documents are written to archive media and retrieval is
 relatively slow. Add to that our processing pipeline (including
 text extraction)... Retrieving and merging minis is faster than
 re-processing and re-indexing from sources.

- In addition to index recovery, mini indexes may be combined into
 custom indexes based on policy.

 From a compliance viewpoint the mini indexes contain logically
 related documents. For example: based on a retention policy,
 documents of type x are to be kept for y years.

 One example for constructing a custom index would be for legal
 discovery.

I see -- it sounds like the "minis" are there for several, application-specific reasons besides backup and recovery. Your scheme sounds like it might be a clever leveraging of everything you did to meet all those other requirements.

For the Lucene projects I've been on, the aggregate size of the source data was about the same as the resulting indexes. In your case I'd guess that the aggregate size of the minis is somewhat larger than the final index, due to duplication of terms. Anyhow, in my projects, recovery is much faster from a backup of the (final) index than from a backup of upstream data followed by reprocessing. It sounds like you've already measured the relevant parameters, though, so maybe my projects' data sets have very different characteristics.

Good luck on your project!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to