Re: [openstack-dev] [oslo] logging around olso lockutils

Sean Dague Fri, 26 Sep 2014 03:55:13 -0700

On 09/25/2014 03:06 PM, Ben Nemec wrote:
> On 09/25/2014 07:49 AM, Sean Dague wrote:
>> Spending a ton of time reading logs, oslo locking ends up basically
>> creating a ton of output at DEBUG that you have to mentally filter to
>> find problems:
>>
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Created new semaphore "iptables" internal_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Acquired semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Attempting to grab external lock "iptables" external_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
>> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
>> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
>> DeleteServersAdminTestXML-1408641898
>> DeleteServersAdminTestXML-469708524] [instance:
>> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
>> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
>> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Released file lock "/opt/stack/data/nova/nova-iptables" release
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Releasing semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
>> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
>>
>> Also readable here:
>> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
>>
>> (Yes, it's kind of ugly)
>>
>> What occured to me is that in debugging locking issues what we actually
>> care about is 2 things semantically:
>>
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of time.
> 
> We did just merge https://review.openstack.org/#/c/122166/ which adds
> some contention/timing information to the log messages and should at
> least be a step toward what you're talking about.
> 
> For context, we had some bad logging that resulted in
> https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 .  That has
> been cleaned up to at least be accurate, but it did add an extra log
> message (created and acquired).  The reason we cared about that is we
> thought there might be a bad interaction between our code and eventlet,
> so we wanted to know whether we were in fact locking the same semaphore
> twice or mistakenly creating two separate ones (as it turns out, neither
> - it was just the bad logging I mentioned earlier).
> 
> So, given that I think everyone involved agrees that the double-locking
> thing was a cosmetic issue and not a functional one we could probably
> just remove the created/using messages here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202
> which would eliminate one message per lock operation without
> significantly impacting debuggability of our code.  Those messages only
> exist because we didn't trust what was going on at a lower level.
> 
> It would also be nice to reduce the duplication of the messages.  For
> example, we log that we acquired a semaphore here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L229
> and then log essentially the same information here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L270
> 
> I don't think we should completely remove the logging in either place
> because depending on how the code is called both can serve a useful,
> non-overlapping purpose, but we could add a parameter to lock() that
> tells it whether it needs to log the acquire/release, and then when it's
> called from synchronized() we just set that to False since we know that
> information is already going to be logged.
> 
> Oh, I also see the logging is hosed for external_lock too.  We need to
> fix that to work like the internal locks do, possibly with additional
> log message de-duplication (although we're actually dealing with two
> separate locks there, so that might not be wise...).
> 
> In summary, my initial suggestions would be:
> 1) remove the logging in internal_lock
> 2) Eliminate the duplicate logging when using the synchronized decorator
> 3) Fix the external_lock logging since it's useless and misleading right
> now.
> 
> I don't know that those cleanups would completely address your issues,
> but I think that along with the already-merged timing patch I linked
> above would get us a long way toward less sucky lockutils logging, and
> would do so with minimal performance impact.  If that proves to still be
> too much log spam we could look into more drastic measures later of course.
> 
> Thoughts?


Let me take a look. Because of where we are in freeze I won't be able to
see much of the real impacts of this yet on Nova. But it sounds at least
partially promising.

I do really want the live warnings on long locks if we can get it. Only
reporting at the end of a situation means you can't actually figure out
you are deadlocked.

        -Sean

-- 
Sean Dague
http://dague.net

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo] logging around olso lockutils

Reply via email to