On 10/18/2013 1:08 AM, Shai Erera wrote:
The codec intercepts merges in order to clean up files that are no longer
referenced

What happens if a document is deleted while there's a reader open on the
index, and the segments are merged? Maybe I misunderstand what you meant by
this statement, but if the external file is deleted, since the document is
"pruned" from the index, how will the reader be able to read the stored
fields from it? How do you track references to the external files?
Right now you get a FileNotFoundException, or a missing field value, depending on how you configure the codec. I believe the tests probably pass only because they don't test for the missing field value. Certainly I have a test (like the one you wrote, but that checks the field value explicitly) that exposes this problem. My reasoning was that this is similar to the situation with NFS: the user has to be aware of the situation and deal with it by having an IndexDeletionPolicy that maintains old commits. I don't see what else can be done without some (possibly heavyweight) additional tracking/garbage collection mechanism.

In our case (document archive), this behavior may be acceptable, but it's certainly one of the main areas that concerns me. It would be nice if it were possible to receive an event when all outstanding readers for a commit were closed: that way we could clean up then instead of at the time of the commit, but I don't think this is how Lucene works? At least I couldn't see how to do that, and given the discussion in IndexDeletionPolicy about NFS, I assumed that wasn't possible.

Another unsolved problem is how to clean up empty segments. Normally they're merged by simply not copying them, but in our case we have to actively delete. I haven't looked at this carefully yet, but I have a couple of ideas: one is to use the Lucene docids as part of the filename: the idea being that as those are re-assigned, we would rename the files, unlinking the old ones with the same docid in the process. But I'm not totally clear on how the docid renumbering works, so not sure if that would be feasible. Another idea is to use filesystem hard linking in some way as a reference counting mechanism, but that would restrict this to java 7. Finally, I suppose it's possible to build some data structure that actively manages the file references.

I guess my initial concern was with testing performance to see if it was even worth trying to solve these problems. Now I think it is, but they are not necessarily easy to solve.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to