I discovered the race condition bug related to CLOUDSTACK-8848 while
testing in our lab and daan started a PR
https://github.com/apache/cloudstack/pull/829 for discussion.
But it turned out to be a dead end discussion. Daan and I started a
debug session on Friday a week ago and we discovered the real problem,
but it was unclear how it can be solved. Daan was off from the next day on.
After another discussion with @anshul1886 started at
https://github.com/apache/cloudstack/pull/829#issuecomment-141613687 he
brought me to the solution I created in
https://github.com/apache/cloudstack/pull/885.
The related comment from ashul:
>From code it seems to be getting updated and DB also suggests that.
>It will not be updated if there is no power change for
>MAX_CONSECUTIVE_SAME_STATE_UPDATE_COUNT. But that is to reduce DB
>transactions and will not create issues as it is updated if there is
>change in power state.
This means all the calculation of how to handle a missing power state is
related to an outdated DB record due DB transaction optimization.
My change makes sure if we detected a outdated record, we reset the
counter to make sure we get new state updates.
In the worst case (if the VM is really missing), the handling of missing
state updates is postponed to the next missingStateReport. So to me,
this is really a safe way to fix this issue.
I patched our lab environment, where we discovered the race condition in
the first place and we didn't see the bug happened again.
You can find the logs here https://github.com/apache/cloudstack/pull/885
attached to the PR.
It isn't easy to test it, I learned when to start a VR migration to hit
the race condition. So that is why I write this message to show you I
tested it in real world conditions.
Yours
resmo