> > Ah, I keep always assuming random partitions since it is a very common > case (just to be sure: unless you specifically want the ordering > despite the downsides, you generally want to default to the random > partitioner). > Yes, I'm working on geographical data so everything is keyed by a derivation of the z-order curve. Reads are such that they read linear ranges of the values. So although there's data in a lot of places, writes & reads happen more often in only a subset hence my need to rebalance not by storage load but by query load. I'm going to have to think about how to do that..
> > When I get a log like these, there always is a "cluster-freeze" during > the > > preceding minute. By "cluster-freeze", I mean that a couple of nodes go > to > > 0% utilization (no cpu, no system, no io) > An hypothesis here is that your workload is causing problem for a node > (for example, sudden spikes in memory allocation causing full GC > fallbacks that take time), and both the readers and the writers get > "stuck" on requests to those nodes (once a sufficient number of > requests happen to be destined to those). The result would be that all > other nodes are no longer seeing traffic because the clients aren't > making progress. > I did open 10 windows on my screen in order to view iostat & vmstat in parallel on all nodes. It's hard to be definitive but I did see moments where the cluster was "freezing" and the CPU was at 100% for a couple seconds on one of the nodes. It didn't happen everytime but that may be because my processes were backing off while failing-over to another node. So could sending a batch of 90 counter mutations produce such a spike ? It looks a little small to me but, yes, maxing the batches to 30/40 elements has eliminated the problem.I looked at the system.log on the node on which this happened and I only see a ParNew collection taking place at that time and heap usage is low. INFO [ScheduledTasks:1] 2011-12-11 15:48:14,641 GCInspector.java (line 122) GC for ParNew: 334 ms for 1 collections, 4729419512 used; max is 16838033408 > I would first eliminate or confirm any GC hypothesis by running all > nodes with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintGCDateStamps. > Is full GC not being logged through GCInspector with the defaults ? > If you can see this happen sufficiently often to > manually/interactively "wait for it", I suggest something as simple as > fireing up an top + iostat for each host and have them on the screen > at the same time, and look for what happens when you see this again. > If the problem is fallback to full GC for example, the affected nodes > should be churning 100% CPU (one core) for several seconds (assuming a > large heap). If there is a sudden burst of disk I/O that is causing a > hiccup (e.g. dirty buffer flushing by linux) this should be visibly > correlated with 'iostat -x -k 1'. > some CPU correlation in some cases (on one node) no iostat correlation ever Thanks Philippe