weizhouapache opened a new pull request, #10514:
URL: https://github.com/apache/cloudstack/pull/10514

   ### Description
   
   This PR aims to improve the process on some agent commands and answers.
   
   #### Current process
   
   Many cloudstack operations require the communication between management 
server and cloudstack agent. 
   The normal process is
   
   management server --> send commands to agents --> agents process the 
commands -> 
   agents send the answers to management server --> management server process 
the answers
   
   Each operation might have one or more processes above.
   
   #### Issues in some scenarios
   
   Normally the process works fine. However, there are some issues in some 
scenarios
   - agent lost connectivity to management server
   - agent crashes
   - agent is stopped or restarted
   - management server is restarted (**not in scope of FR doc**)
   
   Consider the following examples
   - agent has processed the command completely, and sent the answer to 
management server but the answer is not received on management server 
(connectivity issue or worker threads are all in use), therefore timed out on 
management server
   - agent has asked 3rd party (for example libvirt or scaleio) to process the 
command, but agent crashes, which leads to some inconsistent state of 
resources. For example vm has been migrated, volume has been copied to 
destination pool, but management server does not get the results.
   - management is restarted while vm/volume is being migrated, which leads 
vm/volume to be stuck at Migrating state.
   
   #### Operations to address
   
   This FR focuses on the following operations
   - migrate vm
   - migrate vm with volumes
   - migrate volume to another pool
   
   The backend processes can be found at
   
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337678693#AsyncAgentCommandReconciliation-4.1BackendcommandsofVMandvolumemigrations
   
   ### Main changes
   
   Design doc: 
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Async+Agent+Command+Reconciliation
   
   #### Global settings
   
      * reconcile.commands.enabled (default: true)
      * reconcile.commands.interval (default: 60)
      * reconcile.commands.max.attempts (default: 30)
      * reconcile.commands.workers (default: 100)
   
   #### New terminology: Reconcile commands
   
   - Add command state: CommandInfo.State
       * Management server: CREATED -> COMPLETED/FAILED or 
INTERRUPTED/TIMED_OUT -> RECONCILING -> 
RECONCILED/RECONCILE_RETRY/RECONCILE_FAILED
       * Agent: STARTED -> PROCESSING -> COMPLETED/FAILED/INTERRUPTED
   - Add property to Command: isReconcile (false by default). true for 3 
commands
       * CopyCommand
       * MigrateCommand
       * MigrateVolumeCommand
   
   How it works
   
     * management server creates a record in reconcile_commands table when 
create the command
     * agent updates the command/answer in a JSON file while process the command
     * agent syncs with management server every 60 seconds (ping.interval)
     * when management server receives the update from agent, it updates the 
database
     * when agent receives the PingAnswer form management server, it removes 
the JSON file with state=COMPLETED/FAILED
   
   For reconcile commands, during stop/start of mgmt server and agent
   
     * when agent is stopped/started, it updates the state (by agent) to 
INTERUPTED in JSON files and sync with mgmt server
     * when mgmt server is stopped/started, it updates the state (by 
management) to INTERUPTED in database
     * Every minute, management server loads the reconcile commands and 
reconcile the command in INTERUPTED or TIMED_OUT state
   
   #### Improvement on management server when wait for the answer of reconcile 
commands
   
     * meanwhile, every 60 seconds, it reads the state and answer from database
     * if state is INTERUPTED or FAILED, it terminates the process
     * if state is COMPLETED and answer is null, it processes the answer as 
normal
     * <b>This fixes the intermittent connection failure between agent and 
management server</b>
     * when times out, it updates the state (by mgmt) to TIMED_OUT in database 
for further reconciliation
   
   
   #### Improvement on VM migration w/wo volumes
   
     * If the operation fails with operation timeout exception, it might 
because of the connection failure between soource host and management server
     * check if the vm is Running on destination host, if yes, consider the 
migration succceed
     * if succeed, consider migration as success, and commit the network and 
volume changes.
     * if not, destroy the vm on destination host and rollback the  
network/volume changes
   
   #### Fixes after Volume migration
   
     * on NFS, fix the new volumes at Creating state after failed migration
     * on Powerflex, fix the read-only VM issue by checking if volume has been 
migrated by checking via ScaleIO gateway
   
   #### Improvement on Agent
   
     * Add MigrationCancelHook to terminate vm migration jobs when agent is 
disconnected or restarted
     * Add VolumeMigrationCancelHook to terminate block copy jobs when agent is 
disconnected or restarted
   
   ### Test results
   
   It has been tested by dev on NFS and Powerflex
   
   Refer to 
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Async+Agent+Command+Reconciliation#AsyncAgentCommandReconciliation-4.3Summaryoftestresults
   
   <!--- Describe your changes in DETAIL - And how has behaviour functionally 
changed. -->
   
   <!-- For new features, provide link to FS, dev ML discussion etc. -->
   <!-- In case of bug fix, the expected and actual behaviours, steps to 
reproduce. -->
   
   <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be 
closed when this PR gets merged -->
   <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" -->
   <!-- Fixes: # -->
   
   <!--- 
******************************************************************************* 
-->
   <!--- NOTE: AUTOMATION USES THE DESCRIPTIONS TO SET LABELS AND PRODUCE 
DOCUMENTATION. -->
   <!--- PLEASE PUT AN 'X' in only **ONE** box -->
   <!--- 
******************************************************************************* 
-->
   
   ### Types of changes
   
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [x] New feature (non-breaking change which adds functionality)
   - [ ] Bug fix (non-breaking change which fixes an issue)
   - [x] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   - [ ] build/CI
   - [ ] test (unit or integration test code)
   
   ### Feature/Enhancement Scale or Bug Severity
   
   #### Feature/Enhancement Scale
   
   - [ ] Major
   - [ ] Minor
   
   #### Bug Severity
   
   - [ ] BLOCKER
   - [ ] Critical
   - [ ] Major
   - [ ] Minor
   - [ ] Trivial
   
   ### Screenshots (if appropriate):
   
   ### How Has This Been Tested?
   
   <!-- Please describe in detail how you tested your changes. -->
   <!-- Include details of your testing environment, and the tests you ran to 
-->
   
   #### How did you try to break this feature and the system with this change?
   
   <!-- see how your change affects other areas of the code, etc. -->
   
   <!-- Please read the 
[CONTRIBUTING](https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md) 
document -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to