Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

Reverend Chip Wed, 08 Dec 2010 12:57:29 -0800

On 12/8/2010 7:30 AM, Jonathan Ellis wrote:
> On Tue, Dec 7, 2010 at 4:00 PM, Reverend Chip <rev.c...@gmail.com> wrote:
>> Full DEBUG level logs would be a space problem; I'm loading at least 1T
>> per node (after 3x replication), and these events are rare.  Can the
>> DEBUG logs be limited to the specific modules helpful for this diagnosis
>> of the gossip problem and, secondarily, the failure to report
>> replication failure?
> The gossip problem is almost certainly due to a GC pause.  You can
> check that by enabling verbose GC logging (uncomment the lines in
> cassandra-env.sh).


Makes sense.  I thought full GC events were already logged; I'll see to
the change.  Meanwhile, what's the possible remediation if we're up
against GC?  Maybe just making gossip more forgiving?  Perhaps farming
gossip to a separate process, so frozen JVM can be distinguished from
dead JVM or offline machine.

I wonder whether we're alone in seeing this problem, or if no one else
is noticing, curious, or concerned.

> The replication failure is what we want DEBUG logs for, and
> restricting it to the right modules isn't going to help since when
> you're stress-testing writes, the write modules are going to be 99% of
> the log volume anyway.
>
> Maybe a script to constantly throw away all but the most recent log
> file until you see the WARN line would be sufficient workaround?

OK, that should be workable.

Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

Reply via email to