Sam Tunnicliffe created CASSANDRA-21025:
-------------------------------------------

             Summary: Failure detector max interval value is calculated 
incorrectly
                 Key: CASSANDRA-21025
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21025
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Cluster/Gossip
            Reporter: Sam Tunnicliffe


If this setting is not overridden via the {{cassandra.fd_max_interval_ms}} 
system property ({{{}CassandraRelevantProperties.FD_MAX_INTERVAL_MS{}}}), then 
it is seeded with the value of {{{}FailureDetector.INITIAL_VALUE_NANOS{}}}.
However, a bug in the logic of 
{{FailureDetector$ArrivalWindow::getMaxInterval}} means in this case there is 
an incorrect conversion between time units.
{code:java}
public static long getMaxInterval()
{
   long newValue = 
FD_MAX_INTERVAL_MS.getLong(FailureDetector.INITIAL_VALUE_NANOS);
   if (newValue != FailureDetector.INITIAL_VALUE_NANOS)
       logger.info("Overriding {} from {}ms to {}ms", 
FD_MAX_INTERVAL_MS.getKey(), FailureDetector.INITIAL_VALUE_NANOS, newValue);
   return TimeUnit.NANOSECONDS.convert(newValue, TimeUnit.MILLISECONDS);
}
{code}
If {{FD_MAX_INTERVAL_MS}} is not set, the supplied default 
{{INITIAL_VALUE_NANOS}} is used, but this is then converted as if it were a 
value in millis, inflating it 1000000x.
The effective max interval in this case should be 2 seconds, but instead 
becomes 23 days, 3 hours, 33 minutes & 20 seconds.
The net effect is that intervals way longer than expected can be recorded if 
nodes are intermittently partitioned but not restarted (meaning they retain the 
same gossip generation).
In turn this can cause the phi calculation to react to those nodes much more 
slowly as the mean arrival time interval is much bigger than expected, leaving 
them marked as {{UP}} when they should be {{{}DOWN{}}}.

If {{FD_MAX_INTERVAL_MS}} is overridden then the conversion, and so the 
returned value, is correct (assuming an appropriately scaled values is 
supplied, there is no guardrail to ensure that). Versions earlier than 5.0 are 
not affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to