On  Jun 10, 2008, at 07:17, Sanne Grinovero wrote:

Hello Emmanuel,
as you asked how to use the SnapshotDeletionPolicy:
when you create the IndexWriter do:

  IndexDeletionPolicy policy = new KeepOnlyLastCommitDeletionPolicy();
SnapshotDeletionPolicy snapshotter = new SnapshotDeletionPolicy(policy); IndexWriter writer = new IndexWriter(dir, autoCommit, analyzer, snapshotter);

then when you want to make a copy you "freeze" the existence of current index segments, kindly asking the IndexWriter to avoid r emoving or otherwise change the files your are going to copy; it also tells you which files you should copy to get a clone of the index as at snapshot time:

try {
  IndexCommitPoint commit = snapshotter.snapshot();
  Collection fileNames = commit.getFileNames();
  <iterate over & copy files from fileNames>
} finally {
  snapshotter.release();
}
(credits to Lucene in Action 2°edition MEAP)
Should I add this improvement to JIRA? I'll add the code examples; also you opened already HSEARCH-152
is that meant for this same purpose?

Yes that's the goal for HSEARCH-152



>> Does it somehow involve not having cluster changes (ie intra VM policy rather than inter VM?) I don't really understand what you're asking; I hope previous example contains an answer; it just means you don't need
to lock the index to make a "hotcopy" to anywhere.

My question is does it somehow involve interacting with the IndexWriter so that it does not do stuffs it would do otherwise. ie if I run indexing on VM1 and VM2, will copying files "from VM1" be affected by indexWriter operations from VM2



>>The explanation is correct but a bit cataclysmic. We open the reader *only* if there is an actual deletion for the given Directory provider. So in some systems we might very well almost never open a rw reader. Well yes it doesn't happen always, but not only for deletions: also all updates are split to to delete+insert work AFAIK. if you combine this with the open issue that the index is updated on every dirty entity (not checking if only the indexed fields are changed) this translates to lots of re- opening, twice for every transaction updating some entity... (still correct?)

Yes correct.



>> We need to check how efficient the implementation has been done. If it a mechanism similar to IndexModifier, it's not worth it.
Agree, I'll try to discover that.

>>This is what happens by default already, one index directory by entity type. We could compute a flag at init time to know that (it's alrady in place actually) and use it to use a Term query >> rather than the full query if only one entity is present in the index. Let's open a JIRA issue Well my purpose is to avoid the need to switch from reader to writer, this also simplifies the reordering code were you split all work in two sequences (would be unneded); For deletion and eviction we should check to see if the workspace has an indexwriter or indexreader available and use whatever we have available; but this would complicate code and don't like it, so I really hope we could really avoid the use of indexreader for modification operations; as we found already a good solutions for mass eviction and when the flag says it's ok to delete by id, there's just one case left we should think about. Should I open the JIRA for the partial solution as far as we got currently?

Yep



>> Not if it's updated in a cluster, right?
>> Plus seeing the contention lock we have experienced (on IndexReader) in the recent test case, I want to be sure it's actually faster than opening every time.
Why not? don't you update by using delete+insert in cluster too?

In a cluster, you open the IW do what you have to do protected by the global lock, then close the IW (and release the global lock) which means the IW on a other machine of the cluster can then have it. Does it work the same if you keep the IW opened? Does it release the global lock? Does it cope with other IWs updating the file system?


I'll write performance tests just to be sure, but for past experience I really expect it to be much faster. Also note that all document analysis is done inside the IndexWriter, so not using it concurrently is a bottleneck when the analysis is expensive (think about PDFs in Blobs..); during analysis the files are locked to a single thread/transaction but the time spent in really changing files is just a fraction, and Lucene has it's own proper locks in those points.

cheers,
Sanne


_______________________________________________
hibernate-dev mailing list
hibernate-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hibernate-dev

Reply via email to