On 12/8/2010 7:30 AM, Jonathan Ellis wrote: > On Tue, Dec 7, 2010 at 4:00 PM, Reverend Chip <rev.c...@gmail.com> wrote: >> Full DEBUG level logs would be a space problem; I'm loading at least 1T >> per node (after 3x replication), and these events are rare. Can the >> DEBUG logs be limited to the specific modules helpful for this diagnosis >> of the gossip problem and, secondarily, the failure to report >> replication failure? > The gossip problem is almost certainly due to a GC pause. You can > check that by enabling verbose GC logging (uncomment the lines in > cassandra-env.sh).
Makes sense. I thought full GC events were already logged; I'll see to the change. Meanwhile, what's the possible remediation if we're up against GC? Maybe just making gossip more forgiving? Perhaps farming gossip to a separate process, so frozen JVM can be distinguished from dead JVM or offline machine. I wonder whether we're alone in seeing this problem, or if no one else is noticing, curious, or concerned. > The replication failure is what we want DEBUG logs for, and > restricting it to the right modules isn't going to help since when > you're stress-testing writes, the write modules are going to be 99% of > the log volume anyway. > > Maybe a script to constantly throw away all but the most recent log > file until you see the WARN line would be sufficient workaround? OK, that should be workable.