On 09/25/2014 03:06 PM, Ben Nemec wrote: > On 09/25/2014 07:49 AM, Sean Dague wrote: >> Spending a ton of time reading logs, oslo locking ends up basically >> creating a ton of output at DEBUG that you have to mentally filter to >> find problems: >> >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Created new semaphore "iptables" internal_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Acquired semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Attempting to grab external lock "iptables" external_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got file lock "/opt/stack/data/nova/nova-iptables" acquire >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got semaphore / lock "_do_refresh_provider_fw_rules" inner >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 >> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager >> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 >> DeleteServersAdminTestXML-1408641898 >> DeleteServersAdminTestXML-469708524] [instance: >> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm >> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) >> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Released file lock "/opt/stack/data/nova/nova-iptables" release >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Releasing semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 >> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Semaphore / lock released "_do_refresh_provider_fw_rules" inner >> >> Also readable here: >> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 >> >> (Yes, it's kind of ugly) >> >> What occured to me is that in debugging locking issues what we actually >> care about is 2 things semantically: >> >> #1 - tried to get a lock, but someone else has it. Then we know we've >> got lock contention. . >> #2 - something is still holding a lock after some "long" amount of time. > > We did just merge https://review.openstack.org/#/c/122166/ which adds > some contention/timing information to the log messages and should at > least be a step toward what you're talking about. > > For context, we had some bad logging that resulted in > https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 . That has > been cleaned up to at least be accurate, but it did add an extra log > message (created and acquired). The reason we cared about that is we > thought there might be a bad interaction between our code and eventlet, > so we wanted to know whether we were in fact locking the same semaphore > twice or mistakenly creating two separate ones (as it turns out, neither > - it was just the bad logging I mentioned earlier). > > So, given that I think everyone involved agrees that the double-locking > thing was a cosmetic issue and not a functional one we could probably > just remove the created/using messages here: > https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202 > which would eliminate one message per lock operation without > significantly impacting debuggability of our code. Those messages only > exist because we didn't trust what was going on at a lower level. > > It would also be nice to reduce the duplication of the messages. For > example, we log that we acquired a semaphore here: > https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L229 > and then log essentially the same information here: > https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L270 > > I don't think we should completely remove the logging in either place > because depending on how the code is called both can serve a useful, > non-overlapping purpose, but we could add a parameter to lock() that > tells it whether it needs to log the acquire/release, and then when it's > called from synchronized() we just set that to False since we know that > information is already going to be logged. > > Oh, I also see the logging is hosed for external_lock too. We need to > fix that to work like the internal locks do, possibly with additional > log message de-duplication (although we're actually dealing with two > separate locks there, so that might not be wise...). > > In summary, my initial suggestions would be: > 1) remove the logging in internal_lock > 2) Eliminate the duplicate logging when using the synchronized decorator > 3) Fix the external_lock logging since it's useless and misleading right > now. > > I don't know that those cleanups would completely address your issues, > but I think that along with the already-merged timing patch I linked > above would get us a long way toward less sucky lockutils logging, and > would do so with minimal performance impact. If that proves to still be > too much log spam we could look into more drastic measures later of course. > > Thoughts?
Let me take a look. Because of where we are in freeze I won't be able to see much of the real impacts of this yet on Nova. But it sounds at least partially promising. I do really want the live warnings on long locks if we can get it. Only reporting at the end of a situation means you can't actually figure out you are deadlocked. -Sean -- Sean Dague http://dague.net _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev