[
https://issues.apache.org/jira/browse/CASSANDRA-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18041235#comment-18041235
]
Stefan Miklosovic edited comment on CASSANDRA-21025 at 11/28/25 9:04 AM:
-------------------------------------------------------------------------
BTW just an observation, maybe it might be done like this, in an ideal world
FD_MAX_INTERVAL_MS("cassandra.fd_max_interval_ms",
Long.toString(FailureDetector.INITIAL_VALUE_NANOS))
The second argument is the default value if not overridden (after resolving
visibility of that variable etc and converting it to millis).
But the problem with this is that by referencing that
FailureDetector.INITIAL_VALUE_NANOS, other static variables in FailureDetector
would be initialized too which is not desirable.
was (Author: smiklosovic):
BTW just an observation, maybe it might be done like this, in an ideal world
FD_MAX_INTERVAL_MS("cassandra.fd_max_interval_ms",
Long.toString(FailureDetector.INITIAL_VALUE_NANOS))
The second argument is the default value if not overridden (after resolving
visibility of that variable etc).
But the problem with this is that by referencing that
FailureDetector.INITIAL_VALUE_NANOS, other static variables in FailureDetector
would be initialized too which is not desirable.
> Failure detector max interval value is calculated incorrectly
> -------------------------------------------------------------
>
> Key: CASSANDRA-21025
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21025
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip
> Reporter: Sam Tunnicliffe
> Assignee: Sam Tunnicliffe
> Priority: Normal
> Fix For: 5.0.x
>
>
> If this setting is not overridden via the {{cassandra.fd_max_interval_ms}}
> system property ({{{}CassandraRelevantProperties.FD_MAX_INTERVAL_MS{}}}),
> then it is seeded with the value of
> {{{}FailureDetector.INITIAL_VALUE_NANOS{}}}.
> However, a bug in the logic of
> {{FailureDetector$ArrivalWindow::getMaxInterval}} means in this case there is
> an incorrect conversion between time units.
> {code:java}
> public static long getMaxInterval()
> {
> long newValue =
> FD_MAX_INTERVAL_MS.getLong(FailureDetector.INITIAL_VALUE_NANOS);
> if (newValue != FailureDetector.INITIAL_VALUE_NANOS)
> logger.info("Overriding {} from {}ms to {}ms",
> FD_MAX_INTERVAL_MS.getKey(), FailureDetector.INITIAL_VALUE_NANOS, newValue);
> return TimeUnit.NANOSECONDS.convert(newValue, TimeUnit.MILLISECONDS);
> }
> {code}
> If {{FD_MAX_INTERVAL_MS}} is not set, the supplied default
> {{INITIAL_VALUE_NANOS}} is used, but this is then converted as if it were a
> value in millis, inflating it 1000000x.
> The effective max interval in this case should be 2 seconds, but instead
> becomes 23 days, 3 hours, 33 minutes & 20 seconds.
> The net effect is that intervals way longer than expected can be recorded if
> nodes are intermittently partitioned but not restarted (meaning they retain
> the same gossip generation).
> In turn this can cause the phi calculation to react to those nodes much more
> slowly as the mean arrival time interval is much bigger than expected,
> leaving them marked as {{UP}} when they should be {{{}DOWN{}}}.
> If {{FD_MAX_INTERVAL_MS}} is overridden then the conversion, and so the
> returned value, is correct (assuming an appropriately scaled values is
> supplied, there is no guardrail to ensure that). Versions earlier than 5.0
> are not affected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]