Re: [openstack-dev] [Openstack-operators] [openstack-operators][heat][oslo.db] Configure maximum number of db connections

Zane Bitter Tue, 19 Jun 2018 09:52:32 -0700

On 18/06/18 13:39, Jay Pipes wrote:

+openstack-dev since I believe this is an issue with the Heat source code.
On 06/18/2018 11:19 AM, Spyros Trigazis wrote:
Hello list,
I'm hitting quite easily this [1] exception with heat. The db serveris configured to have 1000max_connnections and 1000 max_user_connections and in the databasesection of heat
conf I have these values set:
max_pool_size = 22
max_overflow = 0
Full config attached.

I ended up with this configuration based on this formula:
num_heat_hosts=4
heat_api_workers=2
heat_api_cfn_workers=2
num_engine_workers=4
max_pool_size=22
max_overflow=0
num_heat_hosts * (max_pool_size + max_overflow) * (heat_api_workers +num_engine_workers + heat_api_cfn_workers)
704
What I have noticed is that the number of connections I expected withthe above formula is not respected.Based on this formula each node (every node runs the heat-api,heat-api-cfn and heat-engine) should
use up to 176 connections but they even reach 400 connections.

Has anyone noticed a similar behavior?
Looking through the Heat code, I see that there are many methods in the/heat/db/sqlalchemy/api.py module that use a SQLAlchemy session butnever actually call session.close() [1] which means that the sessionwill not be released back to the connection pool, which might be thereason why connections keep piling up.

Thanks for looking at this Jay! Maybe I can try to explain our strategy(such as it is) here and you can tell us what we should be doing instead :)

Essentially we have one session per 'task', that is used for theduration of the task. Back in the day a 'task' was the processing of anentire stack from start to finish, but with our new distributedarchitecture it's much more granular - either it's just the initialsetup of a change to a stack, or it's the processing of a singleresource. (This was a major design change, and it's quite possible thatthe assumptions we made at the beginning - and tbh I don't think wereally knew what we were doing then either - are no longer valid.)

So, for example, Heat sees an RPC request come in to update a resource,it starts a greenthread to handle it, that creates a database sessionthat is stored in the request context. At the beginning of the requestwe load the data needed and update the status of the resource in the DBto IN_PROGRESS. Then we do whatever we need to do to update the resource(mostly this doesn't involve writing to the DB, but there areexceptions). Then we update the status to COMPLETE/FAILED, do somehousekeeping stuff in the DB and send out RPC messages for any otherwork that needs to be done. IIUC that all uses the same session,although I don't know if it gets opened and closed multiple times in theprocess, and presumably the same object cache.

Crucially, we *don't* have a way to retry if we're unable to connect tothe database in any of those operations. If we can't connect at thebeginning that'd be manageable, because we could (but currently don't)just send out a copy of the incoming RPC message to try again later. Butonce we've changed something about the resource, we *must* record thatin the DB or Bad Stuff(TM) will happen.

The way we handled that, as Spyros pointed out, was to adjust the sizeof the overflow pool to match the size of the greenthread pool. Thisensures that every 'task' is able to connect to the DB, because wewon't take the message out of the RPC queue until there is agreenthread, and by extension a DB connection, available. This isinfinitely preferable to finding out there are no connections availableafter you've already accepted the message (and oslo_messaging has anannoying 'feature' of acknowledging the message before it has evenpassed it to the application). It means stuff that we aren't able tohandle yet queues up in the message queue, where it belongs, instead ofin memory.


History: https://bugs.launchpad.net/heat/+bug/1491185

Unfortunately now you have to tune the size of the threadpool to tradeoff not utilising too little of your CPU against not opening too many DBconnections. Nobody knows what the 'correct' tradeoff is, and even if wedid Heat can't really tune it automatically by default because atstartup it only knows the number of worker processes on the local node;it can't tell how many other nodes are [going to be] running and openingconnections to the same database. Plus the number of allowed DBconnections becomes the bottleneck to how much you can scale out theservice horizontally.

What is the canonical way of handling this kind of situation? Retry anyDB operation where we can't get a connection, and close the sessionafter every transaction?

Not sure if there's any setting in Heat that will fix this problem.Disabling connection pooling will likely not help since connections arenot properly being closed and returned to the connection pool to beginwith.
Best,
-jay
[1] Heat apparently doesn't use the oslo.db enginefacade transactioncontext managers either, which would help with this problem since thetransaction context manager would take responsibility for callingsession.flush()/close() appropriately.
https://github.com/openstack/oslo.db/blob/43af1cf08372006aa46d836ec45482dd4b5b5349/oslo_db/sqlalchemy/enginefacade.py#L626


Oh, I thought we did: https://review.openstack.org/330800

cheers,
Zane.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Openstack-operators] [openstack-operators][heat][oslo.db] Configure maximum number of db connections

Reply via email to