On Mon, Mar 15, 2010 at 12:57:08PM -0600, Rich Megginson wrote: > Christopher Wood wrote: > > On Thu, Mar 04, 2010 at 12:06:31PM -0700, Rich Megginson wrote: > > > >> Christopher Wood wrote: > >> > >>> On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote: > >>> > >>> > >>>> Christopher Wood wrote: > >>>> > >>>> > >>>>> I'm just getting started with 389 Directory Server (at work), and I've > >>>>> run into an issue that I'm not certain how to troubleshoot. I would > >>>>> greatly appreciate any assistance or tips you could offer, especially > >>>>> on where to look to see what's failing. > >>>>> > >>>>> Also, I apologize in advance for changing strings related to my > >>>>> employer's directory names and such, as I'm not comfortable with > >>>>> leaking that level information to a public list. > >>>>> > >>>>> > >>>>> > >>>> As well you should be - you should always obscure sensitive information > >>>> like this. > >>>> > >>>> > >>>>> Overview: > >>>>> > >>>>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other > >>>>> subtrees are fine. > >>>>> > >>>>> > >>>>> Top-Level Questions: > >>>>> > >>>>> 1) How do I stop ns-slapd from crashing? > >>>>> > >>>>> > >>>>> > >>>> Good question. > >>>> > >>>> > >>>>> 2) How do I figure out what precisely is causing the crash? (With > >>>>> various levels of debug logging I get the same log entry.) > >>>>> > >>>>> > >>>>> > >>>> You've already used the TRACE level (1) for logging - that's as verbose > >>>> as it gets for this particular operation. Next step would be to try to > >>>> get a core file. > >>>> > >>>> > >>>>> 3) Is it possible to simply import my initialization ldif without > >>>>> duplication checks? > >>>>> > >>>>> > >>>>> > >>>> No. > >>>> > >>>> > >>>>> Background: > >>>>> > >>>>> At work we have NDS 6.2 (single master on a physical server, virtual > >>>>> machine slaves), and would like to move our directories intact to a 389 > >>>>> 2.6 installation via replication. > >>>>> > >>>>> > >>>>> > >>>> What platform/OS? 32-bit or 64-bit? By NDS 6.2 I'm assuming you mean > >>>> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 > >>>> should be hitting the mirrors tomorrow). > >>>> > >>>> > >>> 32 bit > >>> > >>> > >>> > >>>>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 > >>>>> with no difficulties. > >>>>> > >>>>> I compiled our 389 installation from the source packages downloaded > >>>>> from http://directory.fedoraproject.org/wiki/Source. > >>>>> > >>>>> > >>>> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2? > >>>> > >>>> > >>> I used 1.2.6.a1 to compile originally and produce core files to answer > >>> your questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the > >>> same version when trying to initially reproduce something. > >>> > >>> > >>> > >>>> What compiler flags did you use? > >>>> > >>>> > >>> The makefile that came out of ./configure had these: > >>> > >>> CCASFLAGS = -g -O2 > >>> CFLAGS = -g -O2 > >>> CXXFLAGS = -g -O2 > >>> > >>> For the plain debug build I edited that to insert these and rebuilt with > >>> make, make install: > >>> > >>> CCASFLAGS = -g > >>> CFLAGS = -g > >>> CXXFLAGS = -g > >>> > >>> (Fair warning that I'm not a programmer, so I'm not entirely sure doing > >>> that was right.) > >>> > >>> > >>> > >> Note that you don't have to edit the Makefile - you can do a make > >> distclean, then run configure like this: > >> > CFLAGS="-g" /path/to/configure ... > >> > make > >> > >> But that looks right, anyway. Note that if you change the flags like > >> this by editing the makefile, you will have to do a make clean to remove > >> the old object files, so that they will be rebuilt with the new flags. > >> > > > > I did a make clean before rebuilding in this case. For my later debug > > builds I've used your step to re-run configure. > > > > > >>>> Do you have a core file? If so, try using gdb > >>>> gdb /path/to/ns-slapd /path/to/core.pid > >>>> once in gdb, type the "where" command > >>>> (gdb) where > >>>> > >>>> > >>> The original crash didn't produce a core file, but I could get one by > >>> attaching gdb later, to both the original build and a debug build. > >>> > >>> > >>> > >>>>> The underlying platform is: > >>>>> > >>>>> $ uname -a > >>>>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 > >>>>> EDT 2009 i686 i686 i386 GNU/Linux > >>>>> $ cat /etc/redhat-release > >>>>> CentOS release 5.4 (Final) > >>>>> > >>>>> $ free > >>>>> total used free shared buffers > >>>>> cached > >>>>> Mem: 3894000 1336012 2557988 0 144944 > >>>>> 1004716 > >>>>> -/+ buffers/cache: 186352 3707648 > >>>>> Swap: 2031608 0 2031608 > >>>>> > >>>>> > >>>>> Procedure To Crash 389's ns-slapd: > >>>>> > >>>>> a) In the NDS 6.2 admin console, create a new replication agreement for > >>>>> the "o=This Big Net" subtree, and choose to "Create consumer > >>>>> initialization file". > >>>>> > >>>>> b) Copy the file to the 389 server. > >>>>> > >>>>> c) In the 389 2.6 admin console for the Directory Server, in the > >>>>> Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and > >>>>> choose "Initialize Database". Use the ldif file copied over. > >>>>> > >>>>> The ns-slapd process crashes, and I always get this in > >>>>> /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines: > >>>>> > >>>>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file > >>>>> "/home/cwood/tbn.ldif" > >>>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck > >>>>> > >>>>> > >>>>> Other Details: > >>>>> > >>>>> > >>>>> I found two bugs with the str2entry_dupcheck string in it, but they > >>>>> don't seem pertinent: > >>>>> > >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=548115 > >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=243488 > >>>>> > >>>>> > >>>>> This says that str2entry_dupcheck could be about two things: > >>>>> > >>>>> http://docs.sun.com/source/816-6699-10/ax_errcd.html > >>>>> > >>>>> "While attempting to convert a string entry to an LDAP entry, the > >>>>> server found that the entry has no DN." > >>>>> > >>>>> "The server failed to add a value to the value tree." > >>>>> > >>>>> (But this is an exported database from NDS 6.2, and I'm fairly sure, > >>>>> without reading them all, that every entry will have a DN.) > >>>>> > >>>>> > >>>>> > >>>> The log message > >>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck > >>>> > >>>> is just trace information, not a report of a problem or error. > >>>> > >>>> Does the crash happen almost immediately? Or does it take a while? If > >>>> the problem happens quickly, it would be worthwhile to scan the first > >>>> couple of dozen entries looking for things like - entries without a DN - > >>>> attributes without a value > >>>> > >>>> > >>> I checked, and I couldn't see any data errors of this type. > >>> > >>> > >>> > >>>>> If 389 is trying to check for duplicate entries, perhaps there are > >>>>> simply too many DNs? > >>>>> > >>>>> $ grep '^dn:' tbn.ldif | wc -l > >>>>> 636985 > >>>>> $ ls -lh acc.ldif > >>>>> -rw-r--r-- 1 cwood cwood 755M Mar 3 11:24 tbn.ldif > >>>>> > >>>>> > >>>>> > >>>> No. The server should be able to handle this much data easily. And it > >>>> must check for duplicate entries. > >>>> > >>>> > >>>>> Per the instructions here: > >>>>> > >>>>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting > >>>>> > >>>>> I set my debug logging first to 24579: > >>>>> > >>>>> 1 Trace function calls > >>>>> 2 Debug packet handling > >>>>> 8192 Replication debugging > >>>>> 16384 Critical messages > >>>>> > >>>>> Then for the next try at reading logs I set it to 90115, the above plus: > >>>>> > >>>>> 65536 Plug-in debugging > >>>>> > >>>>> However, every time the log ended with the same set of lines noted > >>>>> above. > >>>>> > >>>>> > >>>>> > >>>> 1 Trace is really the best for this particular problem, and as you have > >>>> found it is limited for this particular problem. > >>>> > >>>> I think the next step would be to build the server with full debugging > >>>> information (use -g and omit -O2 or any other -Ox) and get a stack trace > >>>> with full debug information. > >>>> > >>>> > >>> After recompiling I got different results. Still a crash, but after a > >>> different sequence of actions. This time I didn't create the other > >>> subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce > >>> the immediate crash, but it did crash after I tried to stop the import > >>> (rather than watch the errors noted below). > >>> > >>> I went back to my original compile (-g -O2), and also couldn't reproduce > >>> the instant crash under gdb. Here's the "where" output for that: > >>> > >>> (gdb) where > >>> #0 0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0 > >>> #1 0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6 > >>> #2 0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so > >>> #3 0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at > >>> ldap/servers/slapd/back-ldbm/cache.c:585 > >>> #4 0x00b46d9d in import_main_offline (arg=0xab704268) at > >>> ldap/servers/slapd/back-ldbm/import.c:1300 > >>> #5 0x00b47e4d in import_main (arg=0xab704268) at > >>> ldap/servers/slapd/back-ldbm/import.c:1386 > >>> #6 0x001fe6ed in ?? () from /usr/lib/libnspr4.so > >>> #7 0x0022f5ab in start_thread () from /lib/libpthread.so.0 > >>> #8 0x00d84cfe in clone () from /lib/libc.so.6 > >>> > >>> Everything below is for my new compile, with -g but no -O2. > >>> > >>> I used this to attach gdb to the processes: > >>> > >>> gdb /opt/dirsrv/sbin/ns-slapd 12799 > >>> > >>> I got a large number of these in the error log. Possibly this is one > >>> source of my problem, needing to move the schema from Netscape Directory > >>> Server 6.2 before I move the subtrees. > >>> > >>> [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, > >>> ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown > >>> object class "ldapauthvirtuseroc" > >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping > >>> entry "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, > >>> ou=UsersByRealm, o=TBN Net" which violates schema, ending line 2182899 of > >>> file "/home/cwood/tbn.ldif" > >>> > >>> > >> This is definitely a source of your problems. You must update the > >> schema before you can import the data. Although the server should not > >> crash either. > >> > > > > I was having some trouble updating just the schema, so I grossly hacked up > > migrate-ds.pl, DSMigration::fixup99user and DSMigration::migrateSchema. > > With that I managed to get the schema from the old Netscape Directory > > Server 6.2 99user.ldif to a new 389 1.2.5 98mycompany.ldif. The new schema > > showed through in the admin gui, so I take it that I was successful in > > getting my schema updated. > > > > I've been tearing my hair out trying to reproduce the crash and/or get my > > data imported, even with the schema added. (I eyeballed the schema and a > > failing directory entry, and they seemed to match to the naked eye.) > Even though the server should not crash, I believe that if we could get > the syntax problem taken care of, you could import your data without a > crash (according to the stack trace below). > > The message appears to be entry \"ldapAuthLogin=username, > ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=This Realm\" which > violates attribute syntax, ending line 4701511 of file \"/home/cwood/dump.ld > > Would it be possible for you to post excerpts from that file at or > around that line/entry, along with the definition of the attributes used?
Here's the entry, data munged: # entry-id: 1156136 dn: ldapAuthLogin=cwood, ldapAuthRealmName=1.mycompany.com, ou=UsersByRealm, o=maintree modifyTimestamp: 20090611221103Z modifiersName: cn=ldapauth provisioner,cn=config,o=acc global net ldapAuthLogin: cwood ldapAuthRealmName: 1.mycompany.com objectClass: top objectClass: ldapAuthVirtUserOC ldapAuthMailHost: mail.mycompany.com ldapAuthRadiusDNSProfile: dns one ldapAuthBandwidthLimit: x503 ldapAuthForwarding: ldapAuthControlCodePtr: 1234567 creatorsName: cn=ldapauth provisioner,cn=config,o=maintree createTimestamp: 20070716131008Z nsUniqueId: c5760388-1dd111b2-8022c31d-45260000 I twigged to your point about definition of attributes, and tried to add this via ldapmodify (now that I have the right schema). I got this error from 389: adding new entry "ldapAuthLogin=cwood, ldapAuthRealmName=1.mycompany.com, ou=UsersByRealm, o=maintree" ldapmodify: Invalid syntax (21) additional info: ldapAuthControlCodePtr: value #0 invalid per syntax The interesting thing about ldapAuthControlCodePtr is that the value is alway an integer, but in our Netscape Directory Server 6.2 installation it's in the schema as a DN. This has of course worked fine for a number of years, but 389 seems to want a DN to actually be a DN in an ldapmodify (which is just fine by me). I changed the ldapAuthControlCodePtr attribute to be an integer in 389, and I was able to ldapmodify the entry into 389 and find it via ldapsearch. > > Just when I gave up on trying to get it to crash, it did so once more (and > > not since then). However, I did get this, not sure if it's of any use: > > > > Program received signal SIGABRT, Aborted. > > [Switching to Thread 0x9bdd9b90 (LWP 10174)] > > 0x003cc402 in __kernel_vsyscall () > > (gdb) where > > #0 0x003cc402 in __kernel_vsyscall () > > #1 0x001a3df0 in raise () from /lib/libc.so.6 > > #2 0x001a5701 in abort () from /lib/libc.so.6 > > #3 0x001dc28b in __libc_message () from /lib/libc.so.6 > > #4 0x001e4595 in _int_free () from /lib/libc.so.6 > > #5 0x001e66ac in _int_realloc () from /lib/libc.so.6 > > #6 0x001e7276 in realloc () from /lib/libc.so.6 > > #7 0x00da781e in slapi_ch_realloc ( > > block=0x8b886170 "p\003��p\003��ha\210\213ha\210\213g entry > > \"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, > > o=This Realm\" which violates attribute syntax, ending line 4701511 of file > > \"/home/cwood/dump.ldi"..., > > size=6752) at ldap/servers/slapd/ch_malloc.c:199 > > #8 0x00e1184e in slapi_task_log_notice (task=0x8f97460, format=0xc14b2e > > "%s") > > at ldap/servers/slapd/task.c:254 > > #9 0x00bd7cce in import_log_notice (job=0xb3a28420, > > format=0xc15934 "WARNING: skipping entry \"%s\" which violates > > attribute syntax, ending line %d of file \"%s\"") at > > ldap/servers/slapd/back-ldbm/import.c:193 > > #10 0x00bdc7b9 in import_producer (param=0x8f977d0) > > at ldap/servers/slapd/back-ldbm/import-threads.c:601 > > #11 0x001686ed in ?? () from /usr/lib/libnspr4.so > > #12 0x002c55ab in start_thread () from /lib/libpthread.so.0 > > #13 0x0024ccfe in clone () from /lib/libc.so.6 > > > This is definitely a bug - please file a bug at > https://bugzilla.redhat.com/enter_bug.cgi?product=389 and please add the > above stack trace and any other information. Will do. > This looks like a double free, or a free/realloc of un-malloc'd memory. > Would it be possible for you to run the server using valgrind? > I have a script here http://github.com/richm/scripts called ns-slapd.sh > that you can use to replace ns-slapd (moving it to ns-slapd.orig) and > run the server/import/whatever using valgrind. I'll give it a go. > > > > > >> If you run the import job in gdb, and it crashes, gdb should report a > >> SIGSEGV. Since the import is multi-threaded, the crash may not be in > >> the current thread. In this case, you should dump the stack of all > >> threads like this: > >> (gdb) thread apply all bt > >> > > > > Sadly I only remembered gdb "where", but if it crashes again I'll get this > > part right. > > > > > >>> When I tried to cancel my import the log ended here, and ns-slapd crashed. > >>> > >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import > >>> threads... > >>> > >>> It took me a bit to figure out how to get a core file in gdb, but I > >>> eventually got one. > >>> > >>> This is the result of "back" when I didn't have a core file: > >>> > >>> (gdb) back > >>> #0 0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 > >>> "%s") at ldap/servers/slapd/task.c:231 > >>> #1 0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff > >>> "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 > >>> #2 0x00a9dad7 in import_main_offline (arg=0x8a334f0) at > >>> ldap/servers/slapd/back-ldbm/import.c:1196 > >>> #3 0x00a9e01f in import_main (arg=0x8a334f0) at > >>> ldap/servers/slapd/back-ldbm/import.c:1386 > >>> #4 0x001fe6ed in ?? () from /usr/lib/libnspr4.so > >>> #5 0x002165ab in start_thread () from /lib/libpthread.so.0 > >>> #6 0x0081dcfe in clone () from /lib/libc.so.6 > >>> > >>> This is what gdb printed to the terminal when the SIGSEGVs came in: > >>> > >>> Program received signal SIGSEGV, Segmentation fault. > >>> [Switching to Thread 0xb26fdb90 (LWP 13323)] > >>> 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 > >>> "%s") at ldap/servers/slapd/task.c:231 > >>> 231 len = 2 + strlen(buffer) + (task->task_log ? > >>> strlen(task->task_log) : 0); > >>> > >>> Program received signal SIGSEGV, Segmentation fault. > >>> [Switching to Thread 0xb26fdb90 (LWP 14197)] > >>> 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 > >>> "%s") at ldap/servers/slapd/task.c:231 > >>> 231 len = 2 + strlen(buffer) + (task->task_log ? > >>> strlen(task->task_log) : 0); > >>> > >>> When I figured out getting a core dump with my debug build, this is > >>> "where": > >>> > >>> (gdb) where > >>> #0 0x00d2309b in strlen () from /lib/libc.so.6 > >>> #1 0x00d22e05 in strdup () from /lib/libc.so.6 > >>> #2 0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) > >>> at ldap/servers/slapd/ch_malloc.c:277 > >>> #3 0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at > >>> ldap/servers/slapd/dn.c:1229 > >>> #4 0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, > >>> old_pw=0x0) at ldap/servers/slapd/modify.c:582 > >>> #5 0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at > >>> ldap/servers/slapd/modify.c:526 > >>> #6 0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at > >>> ldap/servers/slapd/modify.c:416 > >>> #7 0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of > >>> bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626 > >>> #8 0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at > >>> ldap/servers/slapd/task.c:299 > >>> #9 0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 > >>> "%s") at ldap/servers/slapd/task.c:264 > >>> #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff > >>> "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 > >>> #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at > >>> ldap/servers/slapd/back-ldbm/import.c:1196 > >>> #12 0x00b8301f in import_main (arg=0xb0009318) at > >>> ldap/servers/slapd/back-ldbm/import.c:1386 > >>> #13 0x002516ed in ?? () from /usr/lib/libnspr4.so > >>> #14 0x001e65ab in start_thread () from /lib/libpthread.so.0 > >>> #15 0x00d84cfe in clone () from /lib/libc.so.6 > >>> > >>> > >> This is a known problem > >> *Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop > >> "initialize Database" crashes the server > >> > >>> -- > >>> 389 users mailing list > >>> 389-us...@lists.fedoraproject.org > >>> https://admin.fedoraproject.org/mailman/listinfo/389-users > >>> > >>> > >> -- > >> 389 users mailing list > >> 389-us...@lists.fedoraproject.org > >> https://admin.fedoraproject.org/mailman/listinfo/389-users > >> > > -- > > 389 users mailing list > > 389-us...@lists.fedoraproject.org > > https://admin.fedoraproject.org/mailman/listinfo/389-users > > -- > 389 users mailing list > 389-us...@lists.fedoraproject.org > https://admin.fedoraproject.org/mailman/listinfo/389-users -- 389 users mailing list 389-us...@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users