On Tue, May 10, 2011 at 6:20 PM, Terje Marthinussen <tmarthinus...@gmail.com> wrote: > >> Everyone may be well aware of that, but I'll still remark that a minor >> compaction >> will try to merge "as many 20MB sstables as it can" up to the max >> compaction >> threshold (which is configurable). So if you do accumulate some newly >> created >> sstable at some point in time, the next minor compaction will take all of >> them >> and thus not create a 40 MB sstable, then 80MB etc... Sure there will be >> more >> step than with a major compaction, but let's keep in mind we don't >> merge sstables >> 2 by 2. > > Well, you do kind of merge them 2 by 2 as you look for at least 4 at a time > ;) > But yes, 20MB should become at least 80MB. Still quite a few hops to reach > 100GB.
Not sure I follow you. 4 sstables is the minimum compaction look for (by default). If there is 30 sstables of ~20MB sitting there because compaction is behind, you will compact those 30 sstables together (unless there is not enough space for that and considering you haven't changed the max compaction threshold (32 by default)). And you can increase max threshold. Don't get me wrong, I'm not pretending this works better than it does, but let's not pretend either that it's worth than it is. > >> I'm also not too much in favor of triggering major compactions, >> because it mostly >> have a nasty effect (create one huge sstable). Now maybe we could expose >> the >> difference factor for which we'll consider sstables in the same bucket > > The nasty side effect I am scared of is disk space and to keep the disk > space under control, I need to get down to 1 file. > > As an example: > 2 days ago, I looked at a system that had gone idle from compaction with > something like 24 sstables. > Disk use was 370GB. > > After manually triggering full compaction, I was left with a single sstable > which is 164 GB large. > > This means I may need more than 3x the full dataset to survive if certain > nasty events such as repairs or anti compactions should occur. > Way more than the recommended 2x. > > In the same system, I see nodes reaching up towards 900GB during compaction > and 5-600GB otherwise. > This is with OPP, so distribution is not 100% perfect, but I expect these > 5-600GB nodes to compact down to the <200GB area if a full compaction is > triggered. > > That is way way beyond the recommendation to have 2x the disk space. > > You may disagree, but I think this is a problem. I absolutely do not disagree. I was just arguing that I'm not sure triggering a major compaction based on some fuzzy heuristic is a good solution to the problem. And we do know that compaction could and should be improved, both to make it have less impact on read when it's behind: https://issues.apache.org/jira/browse/CASSANDRA-2498 to allow for easily testing different strategy: https://issues.apache.org/jira/browse/CASSANDRA-1610 as well as redesigning the mechanism: https://issues.apache.org/jira/browse/CASSANDRA-1608 You'll see in particular in that last ticket comments that segmenting on token space has been suggested already and there is probably a handful of thread about vnodes in the mailing list archives. And I personally think that yes, partitioning the sstables is a good idea. > Either we need to recommend 3-5x the best case disk usage or we need to fix > cassandra. > > A simple improvement initially may be to change the bucketing strategy if > you cannot find suitable candidates. > I believe lucene for instance has a strategy where it can mix a set of small > index fragments with one large. > This may be possible to consider as a fallback strategy and just let > cassandra compact down to 1 file whenever it can. > > Ultimately, I think segmenting on token space is the only way to fix this. > That segmentation could be done by building histograms of your token > distribution as you compact and the compaction can further adjust the > segments accordingly as full compactions take place. > > This would seem simpler to do than a full vnode based infrastructure. > > Terje >