david m wrote:
A couple of reasons that lead to the merge approach:
- Source documents are written to archive media and retrieval is
relatively slow. Add to that our processing pipeline (including
text extraction)... Retrieving and merging minis is faster than
re-processing and re-indexing from sources.
- In addition to index recovery, mini indexes may be combined into
custom indexes based on policy.
From a compliance viewpoint the mini indexes contain logically
related documents. For example: based on a retention policy,
documents of type x are to be kept for y years.
One example for constructing a custom index would be for legal
discovery.
I see -- it sounds like the "minis" are there for several,
application-specific reasons besides backup and recovery. Your scheme
sounds like it might be a clever leveraging of everything you did to
meet all those other requirements.
For the Lucene projects I've been on, the aggregate size of the source
data was about the same as the resulting indexes. In your case I'd
guess that the aggregate size of the minis is somewhat larger than the
final index, due to duplication of terms. Anyhow, in my projects,
recovery is much faster from a backup of the (final) index than from a
backup of upstream data followed by reprocessing. It sounds like you've
already measured the relevant parameters, though, so maybe my projects'
data sets have very different characteristics.
Good luck on your project!
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]