I'm running with 3.4 code and have studied up on all the API related to the 
optimize() replacements and understand I needn't worry about deleted documents, 
but I still want to ask a few things about keeping the index in good shape
And about merge policy.

I have an index with 421163 documents (including body text)
after running a test index for a couple of months with 3.4 code with the 
default LogByteSizeMergePolicy (with everything at defaults: Merge Factor 10, 
MinMergeMB = 1.6, MaxMergeMB = 2048)
And If I don't try a soon to become deprecated (in 3.5) expungeDeletes() (which 
does reduce segments by two)
And only try maybeMerge() which seems to leave everything in place.
I end up with 33 segments with the following sizes in MB (grouped in 1s, 10s , 
100s and 1000s
0.02,0.03,0.06,1.91,
11.19,12.76,15.89,15.98,21.35,24.67,25.61,25.63,30.11,30.90,31.55,31.66,32.52,33.22,36.11,37.14,43.37,
161.72,162.25,166.43,224.10,321.33,
2445.39,2679.24,2908.34,3727.49,3938.23,4044.89,5100.09
(Note I got these values out of CheckIndex (no fix), so I have the documents 
and deleted documents for all segments if we need to talk about those values).

At first glance that looks like a sensible distribution, but if the Merge 
Factor is 10,
why do I have 17 files in the 10-99 range?  Should I not just have 10?

The other problem is that I have two segments with lots of deleted documents, 
both with plenty of deletes, but I don't see what I would do to tidy them up.
Docs      MB         Deleted Docs
8158       321.33   3075
210989  5100.09 158456
The 8158 doc segment is not really that interesting

I'm assuming the biggest one is my original (over optimized) segment from 
months ago when running 3.0.1 code (even though it has been upgraded to 3.4).
It has lots of deleted documents, I assume this is taking up some space.

If I understand the algorithm correctly, the 1st time there is an opportunity 
to clean up all that old stuff (even if it doesn't affect speed too much) is 
when

1.       There are so many new documents that this largest segment would be 
cleaned up and combined into a larger 10,000 MB segment.  I'm not anticipating 
the end users generating 10-20x more files for a long time!

2.       There are so many deletes in this large segment that it could become 
part of merging the 100MB segments into a newly merged 1000MB segment.   I 
don't anticipate the end users replacing 90% of their original documents.


Am I missing some feature of this algorithm or segments in general in which it 
takes a shrinking large segment (many many deletes, as in this case) and 
combines it with the next smaller size segments?
What I'm looking at here is 1/3 of my index is deleted documents.  Need I not 
worry about that at all ?  Is there no way to take the opportunity at some 
point to clean up the large segment of the oldest documents?

Speaking of taking the opportunity to clean up. What happens if I change 
something in my index, maybe a field storage or a norm calculation and I need 
to re-crawl everything?
THEN MY INDEX WILL HAVE 100% REPLACES, so 50% of the index will be deleted 
documents.  Is there something that would be nice to do to clean things up at 
that point?
I think I'm willing to take the hit after I re-crawl, but it is not clear what 
that step might be given the new API?  ExpungeDeletes seemed like a reasonable 
candidate, but it goes away in 3.5.
Am I missing some simple APIs or settings that I can use given one big old 
over-optimized segment and alternatively something to do once I've done a major 
recrawl?

-Paul






































































Reply via email to