On 10/18/2013 1:08 AM, Shai Erera wrote:
The codec intercepts merges in order to clean up files that are no longer
referenced
What happens if a document is deleted while there's a reader open on the
index, and the segments are merged? Maybe I misunderstand what you meant by
this statement, but if the external file is deleted, since the document is
"pruned" from the index, how will the reader be able to read the stored
fields from it? How do you track references to the external files?
Right now you get a FileNotFoundException, or a missing field value,
depending on how you configure the codec. I believe the tests probably
pass only because they don't test for the missing field value.
Certainly I have a test (like the one you wrote, but that checks the
field value explicitly) that exposes this problem. My reasoning was
that this is similar to the situation with NFS: the user has to be aware
of the situation and deal with it by having an IndexDeletionPolicy that
maintains old commits. I don't see what else can be done without some
(possibly heavyweight) additional tracking/garbage collection mechanism.
In our case (document archive), this behavior may be acceptable, but
it's certainly one of the main areas that concerns me. It would be nice
if it were possible to receive an event when all outstanding readers for
a commit were closed: that way we could clean up then instead of at the
time of the commit, but I don't think this is how Lucene works? At
least I couldn't see how to do that, and given the discussion in
IndexDeletionPolicy about NFS, I assumed that wasn't possible.
Another unsolved problem is how to clean up empty segments. Normally
they're merged by simply not copying them, but in our case we have to
actively delete. I haven't looked at this carefully yet, but I have a
couple of ideas: one is to use the Lucene docids as part of the
filename: the idea being that as those are re-assigned, we would rename
the files, unlinking the old ones with the same docid in the process.
But I'm not totally clear on how the docid renumbering works, so not
sure if that would be feasible. Another idea is to use filesystem hard
linking in some way as a reference counting mechanism, but that would
restrict this to java 7. Finally, I suppose it's possible to build some
data structure that actively manages the file references.
I guess my initial concern was with testing performance to see if it was
even worth trying to solve these problems. Now I think it is, but they
are not necessarily easy to solve.
-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org