[jira] [Updated] (CASSANDRA-20054) Get Harry working on top of Accord and fix various issues found by TopologyMixupTestBase

David Capwell (Jira) Wed, 06 Nov 2024 10:56:07 -0800


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-20054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Capwell updated CASSANDRA-20054:
--------------------------------------
    Status: Ready to Commit  (was: Review In Progress)

+1 from Alex in GH and Slack

> Get Harry working on top of Accord and fix various issues found by 
> TopologyMixupTestBase
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20054
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20054
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Accord, Test/fuzz
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 5.x
>
>
> TopologyMixupTestBase has been useful at finding a lot of unexpected issues, 
> and adding Harry on top of Accord at this layer should help validate Accord 
> correctness while also testing stability.
> In running these tests several bugs were found
> 1) vtable showing what txn are blocking the queried table would throw error 
> when txn isn’t known, which is valid (report historic transaction…)
> 2) AccordCommandStore submitted sync requests in a blocking manner, but did 
> this on a CommandStore… this lead to a 5 minute deadlock
> 3) MajorityDepsFetcher would have a deadlock as it triggers waiting 
> notifications while holding the lock, and the waiting callers then access 
> more locks, such as the config service lock
> 4) when restarting and learning about removed nodes, AccordService is not 
> setup yet, so need to pass this through to avoid startup issues
> 5) When accord asks TCM for the epoch history, there were no retries which 
> would cause stability issues during startup
> 6) when learning about min epochs needed for startup, purge all starting 
> epochs that are empty as it isn’t needed and only adds costs for startup
> 7) when nodes leave the cluster we did not start durability sync (this isn’t 
> working, but thats a different issue… durability sync requires ALL which 
> isn’t possible)
> 8) TCM’s getLogEntries method hit an edge case with snapshots where it 
> assumed the API was inclusive, but its exclusive; this caused a gap in epochs
> 9) JVM Dtest now supports startup timeouts, this is to avoid issues where 
> startup will take infinity (due to bugs) causing CI to throw away the logs.
> 10) fixed a race condition bug in Harry where the TokenPlacementModel could 
> see a partial row causing NPEs down the line
> 11) Fixed a bug in Harry where Accord timeouts would not retry as they don’t 
> have the expected message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-20054) Get Harry working on top of Accord and fix various issues found by TopologyMixupTestBase

Reply via email to