Hi All,

I was alerted to this problem recently and it's something that affects 
developers so I want to bring it up.  It is a design principle in CloudStack 
that we do not make agent calls within database transactions.  The reason is 
because when you make a call to an external system, there's no guarantee on how 
long the call takes or even whether the call returns.  When a call takes a long 
time, several bad things can happen:
        - The MySQL DB Connection held opened due to the DB transaction goes 
into idle. Eventually, a timeout in MySQL hits and the connection gets severed 
and the transaction is rolled back.  By default, this timeout is 45 seconds but 
can be changed via a parameter in my.cnf.  So it's problem that the agent call 
completes just fine but the DB transaction rolls back and changes are undone.
        - The rows locked in that transaction before the remote agent call 
could be holding up other foreign key checks into the table.  MySQL runs 
foreign key checks in transactions to make sure the data modification and the 
checks are done atomically.  Therefore, these checks must wait for other 
transactions to complete.  Hence, an agent call that takes sometime can 
severely slow down the system, particularly under scale.

We have two solutions to this:
        - Drive agent interactions with states.  There are many examples of 
this in VM, Volume, etc.
        - When the above cannot be done, acquire a lock in the lock table via a 
DAO method call.  Locks do not maintain DB transactions and therefore will not 
run into this problem.  However, you are responsible for releasing locks.  It 
used to be that if you forget to release the locks, the @DB annotation 
automatically releases locks once it went out of the scope and asserts to alert 
the developer.  However, the @DB annotation has been removed in the Spring work 
so I'm not sure if it's still done.  

This is a tough problem to solve because 
        1. It usually works just fine during functional testing.  During scale 
testing, this problem surfaces and often in unexpected places due to the 
foreign key check problem.
        2. For developers, it is difficult for them to know if a method that 
they're calling within a transaction ends up in an agent call.  

There is an assert in AgentManager to ensure that there are no db transactions 
before making a agent call.  Apparently, since the conversion to Maven, no one 
actually runs with assert on any more.  Due to that, this design principle has 
been lost in CloudStack and we're finding more and more calls being made in DB 
transactions.   To counter that, I decided to add a global parameter that turns 
the assert to an actual exception.  It is advised that all developers set this 
global parameter, check.txn.before.sending.agent.commands, during their own 
testing to make sure it doesn't call agent calls in transactions.

--Alex

  

Reply via email to