Could it be this bug? https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
Cheers, Kenny. On Tue, 2015-05-12 at 16:44 -0400, Leonard J. Peirce wrote: > For several years we have run Kerberos (built from source) on Solaris 9/10. > We often dump our production master, copy the file over to our test master, > and reload it with kdb5_util. In this environment everything has always > worked as expected. A full resync of our test slave is immediately initiated, > the database on the test master is dumped, kprop'd to the slave, and reloaded > on the slave. This take just a few minutes to complete. > > We're in the process of moving our Kerberos environment from Solaris > to CentOS Linux. We're running two CentOS 6.6 Linux VMs with Kerberos > RPMs (the latest available from standard repositories) of > > krb5-libs-1.10.3-37.el6_6.x86_64 > krb5-workstation-1.10.3-37.el6_6.x86_64 > krb5-server-1.10.3-37.el6_6.x86_64 > > For reference: > > h.test.admin.private is the master (172.30.110.45) > r.test.admin.private is the slave (172.30.110.46) > > And a few other items of note: > > Both hosts are running NTP so the clocks are in sync > iptables is disabled at boot > MTUs on the network interfaces are default at 1500 > > After configuring things I initialize the master (from a dump of our > production > master) using kdb5_util, start krb5kdc and kadmin on the master and then start > kpropd on the slave, expecting a full resync to start and the slave to be > initialized. But I'm seeing strange and erratic behaviour. > > In syslog, I often see this from kadmind on the master: > > starting > Authentication attempt failed: 172.30.110.46, GSS-API error strings are: > Unspecified GSS failure. Minor code may provide more information > Clock skew too great > GSS-API error strings complete. > Authentication attempt failed: 172.30.110.46, GSS-API error strings are: > Unspecified GSS failure. Minor code may provide more information > Key version number for principal in key table is incorrect > GSS-API error strings complete. > Authentication attempt failed: 172.30.110.46, GSS-API error strings are: > Unspecified GSS failure. Minor code may provide more information > Clock skew too great > GSS-API error strings complete. > Authentication attempt failed: 172.30.110.46, GSS-API error strings are: > Unspecified GSS failure. Minor code may provide more information > Key version number for principal in key table is incorrect > GSS-API error strings complete. > closing down fd 32 > closing down fd 33 > Authentication attempt failed: 172.30.110.46, GSS-API error strings are: > Unspecified GSS failure. Minor code may provide more information > Clock skew too great > GSS-API error strings complete. > Authentication attempt failed: 172.30.110.46, GSS-API error strings are: > Unspecified GSS failure. Minor code may provide more information > Key version number for principal in key table is incorrect > GSS-API error strings complete. > closing down fd 32 > Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming > SerialNo=0; Outgoing SerialNo=N/A, success, > client=kiprop/r.test.admin.priv...@wmich.edu, > service=kiprop/h.test.admin.priv...@wmich.edu, addr=172.30.110.46 > Request: iprop_full_resync_1, spawned resync process 2944, > client=kiprop/r.test.admin.priv...@wmich.edu, > service=kiprop/h.test.admin.priv...@wmich.edu, addr=172.30.110.46 > > Sometimes I just see the Request: iprop_get_updates line, and no complaints > about authentication failures or clock skew. > > On the slave I see syslog entries showing repeated problems with kpropd > connecting to the master: > > /usr/sbin/kpropd: GSS-API (or Kerberos) error while initializing > /usr/sbin/kpropd interface, retrying > > I start kpropd with -d -S and use strace on it and I see that repeatedly > opens /dev/urandom and reads from it just before I see the above error. > > When running kpropd in standalone mode and with debug enabled I see > syslog entry above and eventually I see > > do_standalone: setting resync alarm to 10 > waiting for a kprop connection > doit: setting resync alarm to 5s > Connection from h.test.admin.private > krb5_recvauth(6, kprop5_01, host/r.test.admin.priv...@wmich.edu, ...) > authenticated client: host/h.test.admin.priv...@wmich.edu (etype == DES > cbc mode with CRC-32) > /usr/sbin/kpropd: Connection reset by peer while reading database block > starting at offset 92340224 > Full resync was unsuccessful > > Unfortunately, the resync was not successful. Often (but not always), when > kprop -f starts on the master, the slave_datatrans file will *partially* > copy to the slave, often 60-90% of the data, before the connection hangs > and then times out. I have run strace on both the kprop and kpropd processes > while they are connected. The kprop on the master hangs during a write() > for several minutes and then eventually times out: > > Process 3183 attached - interrupt to quit > writev(4, > [{"\240\37\26+[\16\247\tC\21\6/\243\217\340\0231f\362\245\3\214$\246\227\231N\265\351\366\1\233"..., > 22106}], 1) = -1 > ETIMEDOUT (Connection timed out) > write(2, "kprop", 5) = 5 > write(2, ": ", 2) = 2 > write(2, "Connection timed out", 20) = 20 > write(2, " ", 1) = 1 > write(2, "while sending database block sta"..., 49) = 49 > ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffadfee150) = -1 ENOTTY > (Inappropriate ioctl for device) > write(2, "\n", 1) = 1 > exit_group(1) = ? > Process 3183 detached > > Occasionally, when left alone, the slave will be successfully sync'd. > Sometimes this happens in minutes, even hours, and, on a few occasions, > overnight. When successful, the syslog entries on both the master and > slave look nominal. Once sync'd, incremental propagation works fine. > > In my debugging attempts, I tried starting kpropd with > > kpropd -S -d -P NNN > > and then attempt to run > > kprop -f slave_datatrans -P NNN r.test.admin.private > > on the master but kpropd on the slave doesn't appear to be listening > on port NNN. Am I misunderstanding something? > > In syslog on the slave, on an aborted resync I sometimes see > > May 6 17:18:50 r.test.admin.private kpropd[7078]: /usr/sbin/kpropd: > Connection reset by peer while reading database block > starting at offset 0 > May 6 17:18:50 r.test.admin.private kpropd[7032]: kpropd: Full resync, > invalid return. > > Thinking that perhaps this was a problem with Kerberos that is available in > the repositories for CentOS I decided to download the latest version (1.13.2) > and build it from source. It built cleanly and passed the tests but when I > tried it I noticed the same behaviour as the Kerberos from the RPMs. > > Has anyone else seen anything like this? > > TIA... > __ > Leonard J. Peirce > Western Michigan University > Office of Information Technology > Kalamazoo, MI 49008 > ________________________________________________ > Kerberos mailing list Kerberos@mit.edu > https://mailman.mit.edu/mailman/listinfo/kerberos > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ________________________________________________ Kerberos mailing list Kerberos@mit.edu https://mailman.mit.edu/mailman/listinfo/kerberos