[ https://issues.apache.org/jira/browse/CAUSEWAY-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Keir Haywood resolved CAUSEWAY-3775. ------------------------------------------- Resolution: Fixed > Improve background commands job failure handling > ------------------------------------------------ > > Key: CAUSEWAY-3775 > URL: https://issues.apache.org/jira/browse/CAUSEWAY-3775 > Project: Causeway > Issue Type: Bug > Components: Core, Ext Core CommandLog > Affects Versions: 2.0.0 > Reporter: Daniel Keir Haywood > Assignee: Daniel Keir Haywood > Priority: Minor > Fix For: 2.1.0 > > > was: Suspect that failure of background command results in locks being held > as attempt to sync exception. > > having looked at the code, can't find a root cause for this per se, but there > were several bugs: > 1) we call transactionService.callInTran(...) 3 times, if there's a failure > then the two innermost correctly detect and result in setting the xactn to > MUST_ABORT, the outer most one didn't and attempted to commit ... but then > the MUST_ABORT is noticed and so it is effectively a commit after all. > 2) if there is a failure then the startedAt on the CommandLogEntry, which > initially is popluated when the execution starts, but ends up being rolled > back to null if there's a failure. Meantime, the recovery processing was > setting the completedAt and the exception stacktrace, but the startedAt is > left untouched, ie null. > Fortuitiously, the query to find non-started queries looks at the startedAt > rather than completedAt, and so re-executes. > 3) the logic to execute each command in its own interaction was using > REQUIRES_NEW, which suspends the top-level transaction that's started > automatically when the interactionService.call..() is called. the failure > within the transactions sets that new transaction as MUST_ABORT, but once > finished, the original transaction is resumed and that is still healthy - > resulting in the onCompletion publishing to be performed as normal. This > would include the audit trail (EntityChangeTrackerDefault). > the fixes are: > 1) make sure that rollback is called for the outermost. > 2) provide a new config option > (causeway.extensions.command-log.run-background-commands.on-failure-policy) > to decide whether to STOP_THE_LINE (this was the accidental but most safe > original behaviour) or to CONTINUE_WITH_NEXT. If the latter, then we do > correctly log the exception and move on. > 3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the > failure correctly is captured. This means that no command complete listeners > (eg auditing) is performed if there was a failure. > > to summarise, the new behaviour is (or is intended to be): > * if onFailurePolicy = STOP_THE_LINE (the safe, fail-fast mode) > ** execute command (CommandLogEntry) > *** if succeeds, then run publish on interactoin completion (auditing etc) > *** if deadlocks, attempt up to 3x else treat as failure > *** if fails, then do nothing. > ** The quartz job will pick up the _same_ command again in 10 seconds. > * if onFailurePolicy = CONTINUE_WITH_NEXT > ** execute command > *** if succeeds, then run publish on interactoin completion (auditing etc) > *** if deadlocks, attempt up to 3x else treat as failure > *** if fails, then update CommandLogEntry as completed and its exception > prop to hold the stack trace. > ** the quartz job will pick up the next command in 10 seconds. > ** > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)