Package: cfengine2
Version: 2.2.10-5
Severity: normal

Hi,

So, first let me preface this with the understanding that cfengine2 is very old code and not being maintained anymore, so I'll understand if you don't accept the patches I'm proposing. However, since it isn't changing much, cfengine2 is more more or less stable and works as expected. Also, we have *a lot* cfinputs rules and it would be far more costly to switch to another system right now than to simply fix this bug. The fact that Debian still has a package for cfengine2 makes me think that enough other folks are in this boat that I thought it might be worth posting a bug on our experiences.


Ok, so the bug:

1) A long series of preceeding actionsequences.

In our case, just a long set of copy rules.

For instance:

control:
        actionsequence = ( copy links )
copy:
        somehost::
# stuff that takes at least 1 minute

Turns out that the main reason they're slow is due to poor bdb based lock implementation that causes an fdatasync() for every GetLock()/ReleaseLock(), which may be in the thousands.

There's also some time spent waiting for TCP buffers to fill before exchanging small stat info for already up to date files.


2) A colliding lock name for certain actions.

In our case it was for several long symlink rules:

links:
        somehost::
                /usr/share/wordpress/wp-content/plugins/advanced-custom-fields 
-> /home/vhosts-wp/plugins.dev/advanced-custom-fields
                /usr/share/wordpress/wp-content/plugins/advanced-custom-fields-pro 
-> /home/vhosts-wp/plugins.dev/advanced-custom-fields-pro
...

These all get the following lock name:
_usr_share_wordpress_wp_content_plugins_advanced_c__home_vhosts_wp_plugins_dev_advanced_custom_fields

Turns out that MakeLinks() in do.c, when attempting to compose a unique lock name for the rule, truncates both the LHS and RHS and combines them, before handing it off to GetLock() in locks.c.

Well, GetLock() does it's own quasi checksum based lock name deduplication, so the initial lock name truncation in MakeLinks() actually prevents GetLock() from doing the right thing.

Turns out that the copy action processes is also affected by this same lock name truncation issue.


3) Since the copy action (in our example) takes more than 1 minute, the check to GetLastLock() in GetLock() fails due to comparing the timestamp of the first completed colliding link action with the start of the full DoTree() cfagent run, rather than the first time any link action actually took place.



To address these things (and document them further), I've made a github repo here:
https://github.com/bpkroth/cfengine2/issues/1 (correctness)
https://github.com/bpkroth/cfengine2/issues/2 (performance optimizations)


Addressing the performance of the locking by batching up fdatasync() calls to the cfengine_lock_db [1] helps with all of these issues (in our situation) and reduces the runtime of cfagent by 50% (in our situation), however it's ultimately just a workaround.

Adding the TCP_NODELAY flag also helps a bit with the stat delays, but it's a relatively minor optimization compared with the fdatasyncs [2].

Changing the timestamp checking to be based on the action sequence also mitigates the issue, though the changes were too invasive to be complete [3].

Simply not truncating the lock name (at least as much) and filling the available lock name buffer size to allow GetLock() to do its own checksum uniquifying also solves this particular issue more directly [4].

[1] 
https://github.com/bpkroth/cfengine2/commit/5543c87f5fb102cecbe60b3d2d6b258c9d8f0baa
[2] 
https://github.com/bpkroth/cfengine2/commit/a2807337e37ecad20f6cf65a4b3bdc0ac1b29651
[3] 
https://github.com/bpkroth/cfengine2/commit/f346e8153bbe325924d751bd3ccdd830b88f2671
[4] 
https://github.com/bpkroth/cfengine2/commit/f6658d41c7d0df76088adace1760aa9d9743403f


Anyways, here are two other branches that also turn these patches into quilt patches for Debian packaging:
https://github.com/bpkroth/cfengine2/tree/debian/wheezy/cfengine2_2.2.10-5+patches
https://github.com/bpkroth/cfengine2/tree/debian/jessie/cfengine2_2.2.10-5+patches
(they're actually the same given that jessie and wheezy have identical source packages for cfengine2)


We've been running with them widespread for about a week without any issues.


Let me know if you have any questions or comment.

Thanks for your time.

Cheers,
Brian


*** Reporter, please consider answering these questions, where appropriate ***

  * What led up to the situation?
  * What exactly did you do (or not do) that was effective (or
    ineffective)?
  * What was the outcome of this action?
  * What outcome did you expect instead?

*** End of the template - remove these template lines ***


-- System Information:
Debian Release: 8.5
 APT prefers stable
 APT policy: (500, 'stable'), (120, 'testing'), (110, 'unstable')
Architecture: i386 (i686)

Kernel: Linux 3.16.0-4-686-pae (SMP w/1 CPU core)
Locale: LANG=en_US.utf8, LC_CTYPE=en_US.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)

Versions of packages cfengine2 depends on:
ii  debconf [debconf-2.0]  1.5.56
ii  libc6                  2.19-18+deb8u4
ii  libdb5.3               5.3.28-9
ii  libssl1.0.0            1.0.1t-1+deb8u2

cfengine2 recommends no packages.

cfengine2 suggests no packages.

-- Configuration Files:
/etc/default/cfengine2 changed [not included]

-- debconf information excluded

Reply via email to