Thanks Uwe, this will help us to clarify details and make a decision. If you write your own data files into a given index directory, it is very > likely that you may corrupt your index. In later Lucene versions (5.x) we > are very strict with not allowing files in the index directory, which were > not created by Lucene. So it is better to have your repository data files > completely separated from the index. Alternatively implement your > additional repository data as a Lucene Codec or better store it in > docvalues or stored fields, then it is completely under control of Lucene > and commits work as expected. Why do you need your own logic to write the > repository files into the index directory? Lucene is a perfect datastore, > too. If you use it, you also make sure commits on index and your repository > data is in a consistent state after committing. >
Sure, we keep index and repository in a different directories, like: parent: bucket_N child: repository (it has 2 files: compressed "main" and uncompressed "partial"; background process merges "partial" into main file) child: index (it uses segment number + offset to refer the document from repository; we don't use docvalue/store field for that) As I understand, the reason we have repository instead of "lucene as datastore" - Our data are stream, sort of data feed. Having raw data gives us opportunity to archive it w/o indexes, export and etc. We have no other store where we may retrieve it again. - also, our application parses incoming data stream and saves into repository initially and only then it calls lucene to produce indexes (it reduces risk of data loss). - I am not sure, maybe Lucene didn't provide good compression in old versions and this is our legacy (we started with version 3). But I wasn't around when the decision was made.