>> When you say that it grows constantly, does that mean up to 30 or even
>> farther?
>
> My total data size is 2TB

For the entire cluster? (I realized I was being ambiguous, I meant per node.)

> Actually, I never see the count stable.  When it reached 30 I thinked
> "I am reaching the default upper limit for a compaction, something
> went wrong" and I went back to 1GB memtables (also, I saw bigger read
> latencies).

Assuming 2 TB is for the entire cluster and you have 8 nodes, that's
250 GB per node. So yeah, that'll take a reasonable time to compact
when larger or major compactions take place.

> Well, I think you are right: I am CPU bounded on compaction, because I
> see during compactions a single jvm thread which is almost all the
> time in running state and the disk is not used beyond 50%.

Sounds like it.

> I think that a partial solution would help: if the compaction
> compacted to 'n' diferents new sstables (not one), the implementation
> would be easier. I mean, the compaction would compact, for instance,
> 10 sstables to 2 (being 2 the level of paralelism). In this way, the
> sstables count would remain eventually stable (although higher). What
> do you think?

Yes, that's been my thinking. Even for a single huge column family, it
doesn't really matter that individual large compactions take longer as
long as smaller compactions keep happening concurrently.

It's worth noting though that merely supporting concurrent compactions
in the sense of spawning more than one thread to do it is only a
partial solution; with sufficiently large column families you end up
having to keep several (not just a couple) compactions going in order
to ensure that large and medium compactions are run while at the same
time allowing the smallest compactions to take place. For one thing
you probably don't want to commit too many cores to compactions, and
in addition you can run into issues with I/O becoming seek bound if
you have too many processes trying to stream data concurrently.

I think the optimal solution for this would involve something along
the lines of having a fixed configurable compaction machine
concurrency (number of threads), and then have those threads
interleave an arbitrary number of compactions (do 1 gig, switch to the
other compaction, do 1 gig, etc). That way you have direct control
over actual CPU and I/O concurrency, and can independently make sure
that compactions happen in proper prioritized order, constantly
keeping sstable counts within appropriate bounds.

(For comparison the problem is very similar to PostgreSQL and
auto-vacuuming of databases containing tables with extreme differences
in size. Allowing multiple vacuum processes helps solve this, but is
has the equivalent scalability issues if you were to have a
particularly problematic distribution of table sizes and workload.)

-- 
/ Peter Schuller

Reply via email to