[ 
https://issues.apache.org/jira/browse/CAUSEWAY-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Keir Haywood resolved CAUSEWAY-3775.
-------------------------------------------
    Resolution: Fixed

> Improve background commands job failure handling
> ------------------------------------------------
>
>                 Key: CAUSEWAY-3775
>                 URL: https://issues.apache.org/jira/browse/CAUSEWAY-3775
>             Project: Causeway
>          Issue Type: Bug
>          Components: Core, Ext Core CommandLog
>    Affects Versions: 2.0.0
>            Reporter: Daniel Keir Haywood
>            Assignee: Daniel Keir Haywood
>            Priority: Minor
>             Fix For: 2.1.0
>
>
> was: Suspect that failure of background command results in locks being held 
> as attempt to sync exception.
>  
> having looked at the code, can't find a root cause for this per se, but there 
> were several bugs:
> 1) we call transactionService.callInTran(...) 3 times, if there's a failure 
> then the two innermost correctly detect and result in setting the xactn to 
> MUST_ABORT, the outer most one didn't and attempted to commit ... but then 
> the MUST_ABORT is noticed and so it is effectively a commit after all.
> 2) if there is a failure then the startedAt on the CommandLogEntry, which 
> initially is popluated when the execution starts, but ends up being rolled 
> back to null if there's a failure.  Meantime, the recovery processing was 
> setting the completedAt and the exception stacktrace, but the startedAt is 
> left untouched, ie null.
> Fortuitiously, the query to find non-started queries looks at the startedAt 
> rather than completedAt, and so re-executes.
> 3) the logic to execute each command in its own interaction was using 
> REQUIRES_NEW, which suspends the top-level transaction that's started 
> automatically when the interactionService.call..() is called.  the failure 
> within the transactions sets that new transaction as MUST_ABORT, but once 
> finished, the original transaction is resumed and that is still healthy - 
> resulting in the onCompletion publishing to be performed as normal.  This 
> would include the audit trail (EntityChangeTrackerDefault).
> the fixes are:
> 1) make sure that rollback is called for the outermost.
> 2) provide a new config option 
> (causeway.extensions.command-log.run-background-commands.on-failure-policy) 
> to decide whether to STOP_THE_LINE (this was the accidental but most safe 
> original behaviour) or to CONTINUE_WITH_NEXT.  If the latter, then we do 
> correctly log the exception and move on.
> 3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the 
> failure correctly is captured.  This means that no command complete listeners 
> (eg auditing) is performed if there was a failure. 
>  
> to summarise, the new behaviour is (or is intended to be):
>  * if onFailurePolicy = STOP_THE_LINE   (the safe, fail-fast mode)
>  ** execute command (CommandLogEntry)
>  *** if succeeds, then run publish on interactoin completion (auditing etc)
>  *** if deadlocks, attempt up to 3x else treat as failure
>  *** if fails, then do nothing.  
>  ** The quartz job will pick up the _same_ command again in 10 seconds.  
>  * if onFailurePolicy = CONTINUE_WITH_NEXT
>  ** execute command
>  *** if succeeds, then run publish on interactoin completion (auditing etc)
>  *** if deadlocks, attempt up to 3x else treat as failure
>  *** if fails, then update CommandLogEntry as completed and its exception 
> prop to hold the stack trace.  
>  ** the quartz job will pick up the next command in 10 seconds.
>  **  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to