David Capwell created CASSANDRA-20054:
-----------------------------------------
Summary: Get Harry working on top of Accord and fix various issues
found by TopologyMixupTestBase
Key: CASSANDRA-20054
URL: https://issues.apache.org/jira/browse/CASSANDRA-20054
Project: Cassandra
Issue Type: Bug
Components: Accord, Test/fuzz
Reporter: David Capwell
Assignee: David Capwell
TopologyMixupTestBase has been useful at finding a lot of unexpected issues,
and adding Harry on top of Accord at this layer should help validate Accord
correctness while also testing stability.
In running these tests several bugs were found
1) vtable showing what txn are blocking the queried table would throw error
when txn isn’t known, which is valid (report historic transaction…)
2) AccordCommandStore submitted sync requests in a blocking manner, but did
this on a CommandStore… this lead to a 5 minute deadlock
3) MajorityDepsFetcher would have a deadlock as it triggers waiting
notifications while holding the lock, and the waiting callers then access more
locks, such as the config service lock
4) when restarting and learning about removed nodes, AccordService is not setup
yet, so need to pass this through to avoid startup issues
5) When accord asks TCM for the epoch history, there were no retries which
would cause stability issues during startup
6) when learning about min epochs needed for startup, purge all starting epochs
that are empty as it isn’t needed and only adds costs for startup
7) when nodes leave the cluster we did not start durability sync (this isn’t
working, but thats a different issue… durability sync requires ALL which isn’t
possible)
8) TCM’s getLogEntries method hit an edge case with snapshots where it assumed
the API was inclusive, but its exclusive; this caused a gap in epochs
9) JVM Dtest now supports startup timeouts, this is to avoid issues where
startup will take infinity (due to bugs) causing CI to throw away the logs.
10) fixed a race condition bug in Harry where the TokenPlacementModel could see
a partial row causing NPEs down the line
11) Fixed a bug in Harry where Accord timeouts would not retry as they don’t
have the expected message
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]