[BLOCKER] CLOUDSTACK-8848

Rene Moser Sat, 26 Sep 2015 05:12:45 -0700

I discovered the race condition bug related to CLOUDSTACK-8848 whiletesting in our lab and daan started a PRhttps://github.com/apache/cloudstack/pull/829 for discussion.

But it turned out to be a dead end discussion. Daan and I started adebug session on Friday a week ago and we discovered the real problem,but it was unclear how it can be solved. Daan was off from the next day on.

After another discussion with @anshul1886 started athttps://github.com/apache/cloudstack/pull/829#issuecomment-141613687 hebrought me to the solution I created inhttps://github.com/apache/cloudstack/pull/885.


The related comment from ashul:

>From code it seems to be getting updated and DB also suggests that.

>It will not be updated if there is no power change for>MAX_CONSECUTIVE_SAME_STATE_UPDATE_COUNT. But that is to reduce DB>transactions and will not create issues as it is updated if there is>change in power state.

This means all the calculation of how to handle a missing power state isrelated to an outdated DB record due DB transaction optimization.

My change makes sure if we detected a outdated record, we reset thecounter to make sure we get new state updates.

In the worst case (if the VM is really missing), the handling of missingstate updates is postponed to the next missingStateReport. So to me,this is really a safe way to fix this issue.

I patched our lab environment, where we discovered the race condition inthe first place and we didn't see the bug happened again.

You can find the logs here https://github.com/apache/cloudstack/pull/885attached to the PR.

It isn't easy to test it, I learned when to start a VR migration to hitthe race condition. So that is why I write this message to show you Itested it in real world conditions.


Yours
resmo

[BLOCKER] CLOUDSTACK-8848

Reply via email to