weizhouapache opened a new pull request, #10514: URL: https://github.com/apache/cloudstack/pull/10514
### Description This PR aims to improve the process on some agent commands and answers. #### Current process Many cloudstack operations require the communication between management server and cloudstack agent. The normal process is management server --> send commands to agents --> agents process the commands -> agents send the answers to management server --> management server process the answers Each operation might have one or more processes above. #### Issues in some scenarios Normally the process works fine. However, there are some issues in some scenarios - agent lost connectivity to management server - agent crashes - agent is stopped or restarted - management server is restarted (**not in scope of FR doc**) Consider the following examples - agent has processed the command completely, and sent the answer to management server but the answer is not received on management server (connectivity issue or worker threads are all in use), therefore timed out on management server - agent has asked 3rd party (for example libvirt or scaleio) to process the command, but agent crashes, which leads to some inconsistent state of resources. For example vm has been migrated, volume has been copied to destination pool, but management server does not get the results. - management is restarted while vm/volume is being migrated, which leads vm/volume to be stuck at Migrating state. #### Operations to address This FR focuses on the following operations - migrate vm - migrate vm with volumes - migrate volume to another pool The backend processes can be found at https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337678693#AsyncAgentCommandReconciliation-4.1BackendcommandsofVMandvolumemigrations ### Main changes Design doc: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Async+Agent+Command+Reconciliation #### Global settings * reconcile.commands.enabled (default: true) * reconcile.commands.interval (default: 60) * reconcile.commands.max.attempts (default: 30) * reconcile.commands.workers (default: 100) #### New terminology: Reconcile commands - Add command state: CommandInfo.State * Management server: CREATED -> COMPLETED/FAILED or INTERRUPTED/TIMED_OUT -> RECONCILING -> RECONCILED/RECONCILE_RETRY/RECONCILE_FAILED * Agent: STARTED -> PROCESSING -> COMPLETED/FAILED/INTERRUPTED - Add property to Command: isReconcile (false by default). true for 3 commands * CopyCommand * MigrateCommand * MigrateVolumeCommand How it works * management server creates a record in reconcile_commands table when create the command * agent updates the command/answer in a JSON file while process the command * agent syncs with management server every 60 seconds (ping.interval) * when management server receives the update from agent, it updates the database * when agent receives the PingAnswer form management server, it removes the JSON file with state=COMPLETED/FAILED For reconcile commands, during stop/start of mgmt server and agent * when agent is stopped/started, it updates the state (by agent) to INTERUPTED in JSON files and sync with mgmt server * when mgmt server is stopped/started, it updates the state (by management) to INTERUPTED in database * Every minute, management server loads the reconcile commands and reconcile the command in INTERUPTED or TIMED_OUT state #### Improvement on management server when wait for the answer of reconcile commands * meanwhile, every 60 seconds, it reads the state and answer from database * if state is INTERUPTED or FAILED, it terminates the process * if state is COMPLETED and answer is null, it processes the answer as normal * <b>This fixes the intermittent connection failure between agent and management server</b> * when times out, it updates the state (by mgmt) to TIMED_OUT in database for further reconciliation #### Improvement on VM migration w/wo volumes * If the operation fails with operation timeout exception, it might because of the connection failure between soource host and management server * check if the vm is Running on destination host, if yes, consider the migration succceed * if succeed, consider migration as success, and commit the network and volume changes. * if not, destroy the vm on destination host and rollback the network/volume changes #### Fixes after Volume migration * on NFS, fix the new volumes at Creating state after failed migration * on Powerflex, fix the read-only VM issue by checking if volume has been migrated by checking via ScaleIO gateway #### Improvement on Agent * Add MigrationCancelHook to terminate vm migration jobs when agent is disconnected or restarted * Add VolumeMigrationCancelHook to terminate block copy jobs when agent is disconnected or restarted ### Test results It has been tested by dev on NFS and Powerflex Refer to https://cwiki.apache.org/confluence/display/CLOUDSTACK/Async+Agent+Command+Reconciliation#AsyncAgentCommandReconciliation-4.3Summaryoftestresults <!--- Describe your changes in DETAIL - And how has behaviour functionally changed. --> <!-- For new features, provide link to FS, dev ML discussion etc. --> <!-- In case of bug fix, the expected and actual behaviours, steps to reproduce. --> <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be closed when this PR gets merged --> <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" --> <!-- Fixes: # --> <!--- ******************************************************************************* --> <!--- NOTE: AUTOMATION USES THE DESCRIPTIONS TO SET LABELS AND PRODUCE DOCUMENTATION. --> <!--- PLEASE PUT AN 'X' in only **ONE** box --> <!--- ******************************************************************************* --> ### Types of changes - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [x] New feature (non-breaking change which adds functionality) - [ ] Bug fix (non-breaking change which fixes an issue) - [x] Enhancement (improves an existing feature and functionality) - [ ] Cleanup (Code refactoring and cleanup, that may add test cases) - [ ] build/CI - [ ] test (unit or integration test code) ### Feature/Enhancement Scale or Bug Severity #### Feature/Enhancement Scale - [ ] Major - [ ] Minor #### Bug Severity - [ ] BLOCKER - [ ] Critical - [ ] Major - [ ] Minor - [ ] Trivial ### Screenshots (if appropriate): ### How Has This Been Tested? <!-- Please describe in detail how you tested your changes. --> <!-- Include details of your testing environment, and the tests you ran to --> #### How did you try to break this feature and the system with this change? <!-- see how your change affects other areas of the code, etc. --> <!-- Please read the [CONTRIBUTING](https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md) document --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org