Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

Reverend Chip Tue, 07 Dec 2010 14:00:52 -0800

On 12/7/2010 1:10 PM, Jonathan Ellis wrote:
> I'm inclined to think there's a bug in your client, then.


That doesn't pass the smell test.  The very same client has logged
timeout and unavailable exceptions on other occasions, e.g. when there
are too many clients or (in a previous configuration) when the JVMs had
insufficient memory.  It's too much of a coincidence to believe that the
client's exception reporting happens to fail only at the same time that
a server experiences unexplained and problematic gossip failures.

>   DEBUG-level
> logs could confirm or refute this by logging for each insert how many
> replicas are being blocked for, which nodes it got responses from, and
> whether a TimedOutException from not getting ALL replies was returned
> to the client.

Full DEBUG level logs would be a space problem; I'm loading at least 1T
per node (after 3x replication), and these events are rare.  Can the
DEBUG logs be limited to the specific modules helpful for this diagnosis
of the gossip problem and, secondarily, the failure to report
replication failure?

> On Tue, Dec 7, 2010 at 2:37 PM, Reverend Chip <rev.c...@gmail.com> wrote:
>> No, I'm afraid that's not it:
>>  replica_placement_strategy: org.apache.cassandra.locator.SimpleStrategy
>>  replication_factor: 3
>>
>> On 12/7/2010 6:37 AM, Jonathan Ellis wrote:
>>> If you are using NetworkTopologyStrategy you are probably hitting
>>> https://issues.apache.org/jira/browse/CASSANDRA-1804 which is fixed in
>>> rc2.
>>>
>>> On Mon, Dec 6, 2010 at 6:58 PM, Reverend Chip <rev.c...@gmail.com> wrote:
>>>> I'm running a big test -- ten nodes with 3T disk each.  I'm using
>>>> 0.7.0rc1.  After some tuning help (thanks Tyler) lots of this is working
>>>> as it should.  However a serious event occurred as well -- the server
>>>> froze up -- and though mutations were dropped, no error was reported to
>>>> the client.  Here's what the log said on host X.19:
>>>>
>>>>  WARN [ScheduledTasks:1] 2010-12-06 14:04:11,125 MessagingService.java
>>>> (line 527) Dropped 76 MUTATION messages in the last 5000ms
>>>>
>>>> Meanwhile, on the OTHER nodes, gossip decided the node was not available
>>>> for a while:
>>>>
>>>>  INFO [ScheduledTasks:1] 2010-12-06 14:04:02,396 Gossiper.java (line
>>>> 195) InetAddress /X.19 is now dead.
>>>>  INFO [GossipStage:1] 2010-12-06 14:04:06,127 Gossiper.java (line 569)
>>>> InetAddress /X.19 is now UP
>>>>
>>>> And despite the fact that I was writing with consistency=ALL, none of my
>>>> clients reported any errors on their mutations.
>>>>
>>>> Tyler has this information but I would like to know if anyone has seen
>>>> this before, and/or has a diagnosis.
>>>>
>>>>
>>>
>>
>
>

Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

Reply via email to