More About NOT Optimizing

Paul Hill Tue, 06 Mar 2012 17:01:17 -0800

I'm running with 3.4 code and have studied up on all the API related to the 
optimize() replacements and understand I needn't worry about deleted documents, 
but I still want to ask a few things about keeping the index in good shape
And about merge policy.

I have an index with 421163 documents (including body text)
after running a test index for a couple of months with 3.4 code with the
default LogByteSizeMergePolicy (with everything at defaults: Merge Factor 10,
MinMergeMB = 1.6, MaxMergeMB = 2048)
And If I don't try a soon to become deprecated (in 3.5) expungeDeletes() (which
does reduce segments by two)
And only try maybeMerge() which seems to leave everything in place.
I end up with 33 segments with the following sizes in MB (grouped in 1s, 10s ,
100s and 1000s
0.02,0.03,0.06,1.91,
11.19,12.76,15.89,15.98,21.35,24.67,25.61,25.63,30.11,30.90,31.55,31.66,32.52,33.22,36.11,37.14,43.37,
161.72,162.25,166.43,224.10,321.33,
2445.39,2679.24,2908.34,3727.49,3938.23,4044.89,5100.09
(Note I got these values out of CheckIndex (no fix), so I have the documents
and deleted documents for all segments if we need to talk about those values).

At first glance that looks like a sensible distribution, but if the Merge
Factor is 10,
why do I have 17 files in the 10-99 range? Should I not just have 10?

The other problem is that I have two segments with lots of deleted documents,
both with plenty of deletes, but I don't see what I would do to tidy them up.
Docs MB Deleted Docs
8158 321.33 3075
210989 5100.09 158456
The 8158 doc segment is not really that interesting

I'm assuming the biggest one is my original (over optimized) segment from
months ago when running 3.0.1 code (even though it has been upgraded to 3.4).
It has lots of deleted documents, I assume this is taking up some space.

If I understand the algorithm correctly, the 1st time there is an opportunity
to clean up all that old stuff (even if it doesn't affect speed too much) is
when

1. There are so many new documents that this largest segment would be
cleaned up and combined into a larger 10,000 MB segment. I'm not anticipating
the end users generating 10-20x more files for a long time!

2. There are so many deletes in this large segment that it could become
part of merging the 100MB segments into a newly merged 1000MB segment. I
don't anticipate the end users replacing 90% of their original documents.

Am I missing some feature of this algorithm or segments in general in which it
takes a shrinking large segment (many many deletes, as in this case) and
combines it with the next smaller size segments?
What I'm looking at here is 1/3 of my index is deleted documents. Need I not
worry about that at all ? Is there no way to take the opportunity at some
point to clean up the large segment of the oldest documents?

Speaking of taking the opportunity to clean up. What happens if I change
something in my index, maybe a field storage or a norm calculation and I need
to re-crawl everything?
THEN MY INDEX WILL HAVE 100% REPLACES, so 50% of the index will be deleted
documents. Is there something that would be nice to do to clean things up at
that point?
I think I'm willing to take the hit after I re-crawl, but it is not clear what
that step might be given the new API? ExpungeDeletes seemed like a reasonable
candidate, but it goes away in 3.5.
Am I missing some simple APIs or settings that I can use given one big old
over-optimized segment and alternatively something to do once I've done a major
recrawl?

-Paul

More About NOT Optimizing

Reply via email to