I'm running with 3.4 code and have studied up on all the API related to the optimize() replacements and understand I needn't worry about deleted documents, but I still want to ask a few things about keeping the index in good shape And about merge policy.
I have an index with 421163 documents (including body text) after running a test index for a couple of months with 3.4 code with the default LogByteSizeMergePolicy (with everything at defaults: Merge Factor 10, MinMergeMB = 1.6, MaxMergeMB = 2048) And If I don't try a soon to become deprecated (in 3.5) expungeDeletes() (which does reduce segments by two) And only try maybeMerge() which seems to leave everything in place. I end up with 33 segments with the following sizes in MB (grouped in 1s, 10s , 100s and 1000s 0.02,0.03,0.06,1.91, 11.19,12.76,15.89,15.98,21.35,24.67,25.61,25.63,30.11,30.90,31.55,31.66,32.52,33.22,36.11,37.14,43.37, 161.72,162.25,166.43,224.10,321.33, 2445.39,2679.24,2908.34,3727.49,3938.23,4044.89,5100.09 (Note I got these values out of CheckIndex (no fix), so I have the documents and deleted documents for all segments if we need to talk about those values). At first glance that looks like a sensible distribution, but if the Merge Factor is 10, why do I have 17 files in the 10-99 range? Should I not just have 10? The other problem is that I have two segments with lots of deleted documents, both with plenty of deletes, but I don't see what I would do to tidy them up. Docs MB Deleted Docs 8158 321.33 3075 210989 5100.09 158456 The 8158 doc segment is not really that interesting I'm assuming the biggest one is my original (over optimized) segment from months ago when running 3.0.1 code (even though it has been upgraded to 3.4). It has lots of deleted documents, I assume this is taking up some space. If I understand the algorithm correctly, the 1st time there is an opportunity to clean up all that old stuff (even if it doesn't affect speed too much) is when 1. There are so many new documents that this largest segment would be cleaned up and combined into a larger 10,000 MB segment. I'm not anticipating the end users generating 10-20x more files for a long time! 2. There are so many deletes in this large segment that it could become part of merging the 100MB segments into a newly merged 1000MB segment. I don't anticipate the end users replacing 90% of their original documents. Am I missing some feature of this algorithm or segments in general in which it takes a shrinking large segment (many many deletes, as in this case) and combines it with the next smaller size segments? What I'm looking at here is 1/3 of my index is deleted documents. Need I not worry about that at all ? Is there no way to take the opportunity at some point to clean up the large segment of the oldest documents? Speaking of taking the opportunity to clean up. What happens if I change something in my index, maybe a field storage or a norm calculation and I need to re-crawl everything? THEN MY INDEX WILL HAVE 100% REPLACES, so 50% of the index will be deleted documents. Is there something that would be nice to do to clean things up at that point? I think I'm willing to take the hit after I re-crawl, but it is not clear what that step might be given the new API? ExpungeDeletes seemed like a reasonable candidate, but it goes away in 3.5. Am I missing some simple APIs or settings that I can use given one big old over-optimized segment and alternatively something to do once I've done a major recrawl? -Paul