[ https://issues.apache.org/jira/browse/CASSANDRA-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943238#comment-17943238 ]
Brandon Williams commented on CASSANDRA-19633: ---------------------------------------------- Feature flag LGTM, +1. > Replaced node is stuck in a loop calculating ranges > --------------------------------------------------- > > Key: CASSANDRA-19633 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19633 > Project: Apache Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission > Reporter: Jai Bheemsen Rao Dhanwada > Assignee: Marcus Eriksson > Priority: Normal > Labels: Bootstrap > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: result1.html > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Hello, > > I am running into an issue where in a node that is replacing a dead > (non-seed) node is stuck in calculating ranges forever. It eventually > succeeds, however the time taken for calculating the ranges is not constant. > I do sometimes see that it takes 24 hours to calculate ranges for each > keyspace. Attached the flume graph of the cassandra process during this time, > which points to the below code. > {code:java} > Multimap<InetAddressAndPort, Range<Token>> > getRangeFetchMapForNonTrivialRanges() > { > //Get the graph with edges between ranges and their source endpoints > MutableCapacityGraph<Vertex, Integer> graph = getGraph(); > //Add source and destination vertex and edges > addSourceAndDestination(graph, getDestinationLinkCapacity(graph)); > int flow = 0; > MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result = > null; > //We might not be working on all ranges > while (flow < getTotalRangeVertices(graph)) > { > if (flow > 0) > { //We could not find a path with previous graph. Bump the capacity b/w > endpoint vertices and destination by 1 incrementCapacity(graph, 1); } > MaximumFlowAlgorithm fordFulkerson = > FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance()); > result = fordFulkerson.calc(graph, sourceVertex, destinationVertex, > IntegerNumberSystem.getInstance()); > int newFlow = result.calcTotalFlow(); > assert newFlow > flow; //We are not making progress which should not happen > flow = newFlow; > } > return getRangeFetchMapFromGraphResult(graph, result); > } > {code} > Digging through the logs, I see the below log line for a given keyspace > `system_auth` > {code:java} > INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range > Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on > Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for > keyspace system_auth{code} > corresponding code: > {code:java} > for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries()) > logger.info("{}: range {} exists on {} for keyspace {}", description, > entry.getKey(), entry.getValue(), keyspaceName);{code} > BUT do not see the below line for the corresponding keyspace > {code:java} > RangeStreamer.java:606 - Output from RangeFetchMapCalculator for > keyspace{code} > this means the code it's stuck in `getRangeFetchMap();` > {code:java} > Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap = > calculator.getRangeFetchMap(); > logger.info("Output from RangeFetchMapCalculator for keyspace {}", > keyspace);{code} > Here is the cluster topology: > * Cassandra version: 4.0.12 > * # of nodes: 190 > * Tokens (vnodes): 128 > * Profile Attached > Initial hypothesis was that the graph calculation was taking longer due to > the combination of nodes + tokens + tables but in the same cluster I see one > of the node joined without any issues. > wondering if I am hitting a bug causing it to work sometimes but get into an > infinite loop some times? > Please let me know if you need any other details and appreciate any pointers > to debug this further. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org