Thanks Uwe, this will help us to clarify details and make a decision.

If you write your own data files into a given index directory, it is very
> likely that you may corrupt your index. In later Lucene versions (5.x) we
> are very strict with not allowing files in the index directory, which were
> not created by Lucene. So it is better to have your repository data files
> completely separated from the index. Alternatively implement your
> additional repository data as a Lucene Codec or better store it in
> docvalues or stored fields, then it is completely under control of Lucene
> and commits work as expected. Why do you need your own logic to write the
> repository files into the index directory? Lucene is a perfect datastore,
> too. If you use it, you also make sure commits on index and your repository
> data is in a consistent state after committing.
>

Sure, we keep index and repository in a different directories, like:

parent: bucket_N
     child: repository (it has 2 files: compressed "main" and uncompressed
"partial"; background process merges "partial" into main file)
     child: index (it uses segment number + offset to refer the document
from repository; we don't use docvalue/store field for that)

As I understand, the reason we have repository instead of "lucene as
datastore"
- Our data are stream, sort of data feed. Having raw data gives us
opportunity to archive it w/o indexes, export and etc. We have no other
store where we may retrieve it again.
- also, our application parses incoming data stream and saves into
repository initially and only then it calls lucene to produce indexes (it
reduces risk of data loss).
- I am not sure, maybe Lucene didn't provide good compression in old
versions and this is our legacy (we started with version 3). But I wasn't
around when the decision was made.

Reply via email to