Hello, Our application uses Lucene to index documents received from a back-end that supports storage of temporal data with branches, similar to revision control systems like SVN: when looking at a single object, one can choose to either retrieve the current state, go back to a previous point in time, or switch to an alternative timeline (branch) altogether. For indexing, we are only considering the latest revision ("HEAD") of any object on these branches. Indexes are stored in separate Directories, and the file system's directory layout imitates the nesting of created branches.
Creating a new branch (from the indexes' point of view) ended up very similar to what SVN does, as well: we used SnapshotDeletionPolicy to capture a snapshot of the parent writer, copied the files referenced in the resulting IndexCommit to the directory of the new branch, and released the snapshot. This method quickly became expensive in terms of disk space, as a lot of branches were edited simultaneously, while the number of changed documents per branch is usually small (the dataset has about 10 million documents and the index size is about 2.5 GB). To facilitate better sharing of unchanged data between branches, we used another, customized FileDeletionPolicy for writers, that keeps the branch creation points as IndexCommits in the parent index, and also used a custom Directory implementation similar to FileSwitchDirectory for the branch index, that supplies files either from the (writeable) branch directory, or the (read-only) IndexCommit from the parent. Attempts of syncing and deleting files from the IndexCommit are treated as a no-op. Output files can only be created in the writeable part. This resulted in much better disk space utilization -- branch directories are now growing typically from a few hundred kilobytes to a few megabytes each, after extensive editing. One issue that appeared is when the parent IndexWriter's configured merge policy selects segments for merging from the shared part of two branches; these segments cannot be deleted by the IndexFileDeleter after merging, since another IndexCommit (representing the creation of the branch) still refers to it. This leaves both optimized and unoptimized content in the same directory, which increases disk space usage over time. Currently, the only way I see to prevent this is to create a filtering MergePolicy implementation that removes segments from the list of candidates to be merged if they come from these shared parts. Can you give me some pointers on what would be the best way to do so? Thanks in advance, András --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org