On Thu, Mar 04, 2010 at 12:06:31PM -0700, Rich Megginson wrote: > Christopher Wood wrote: > > On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote: > > > >> Christopher Wood wrote: > >> > >>> I'm just getting started with 389 Directory Server (at work), and I've > >>> run into an issue that I'm not certain how to troubleshoot. I would > >>> greatly appreciate any assistance or tips you could offer, especially on > >>> where to look to see what's failing. > >>> > >>> Also, I apologize in advance for changing strings related to my > >>> employer's directory names and such, as I'm not comfortable with leaking > >>> that level information to a public list. > >>> > >>> > >> As well you should be - you should always obscure sensitive information > >> like this. > >> > >>> Overview: > >>> > >>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other > >>> subtrees are fine. > >>> > >>> > >>> Top-Level Questions: > >>> > >>> 1) How do I stop ns-slapd from crashing? > >>> > >>> > >> Good question. > >> > >>> 2) How do I figure out what precisely is causing the crash? (With various > >>> levels of debug logging I get the same log entry.) > >>> > >>> > >> You've already used the TRACE level (1) for logging - that's as verbose > >> as it gets for this particular operation. Next step would be to try to > >> get a core file. > >> > >>> 3) Is it possible to simply import my initialization ldif without > >>> duplication checks? > >>> > >>> > >> No. > >> > >>> Background: > >>> > >>> At work we have NDS 6.2 (single master on a physical server, virtual > >>> machine slaves), and would like to move our directories intact to a 389 > >>> 2.6 installation via replication. > >>> > >>> > >> What platform/OS? 32-bit or 64-bit? By NDS 6.2 I'm assuming you mean > >> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 > >> should be hitting the mirrors tomorrow). > >> > > > > 32 bit > > > > > >>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 with > >>> no difficulties. > >>> > >>> I compiled our 389 installation from the source packages downloaded from > >>> http://directory.fedoraproject.org/wiki/Source. > >>> > >> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2? > >> > > > > I used 1.2.6.a1 to compile originally and produce core files to answer your > > questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the same > > version when trying to initially reproduce something. > > > > > >> What compiler flags did you use? > >> > > > > The makefile that came out of ./configure had these: > > > > CCASFLAGS = -g -O2 > > CFLAGS = -g -O2 > > CXXFLAGS = -g -O2 > > > > For the plain debug build I edited that to insert these and rebuilt with > > make, make install: > > > > CCASFLAGS = -g > > CFLAGS = -g > > CXXFLAGS = -g > > > > (Fair warning that I'm not a programmer, so I'm not entirely sure doing > > that was right.) > > > > > Note that you don't have to edit the Makefile - you can do a make > distclean, then run configure like this: > > CFLAGS="-g" /path/to/configure ... > > make > > But that looks right, anyway. Note that if you change the flags like > this by editing the makefile, you will have to do a make clean to remove > the old object files, so that they will be rebuilt with the new flags.
I did a make clean before rebuilding in this case. For my later debug builds I've used your step to re-run configure. > >> Do you have a core file? If so, try using gdb > >> gdb /path/to/ns-slapd /path/to/core.pid > >> once in gdb, type the "where" command > >> (gdb) where > >> > > > > The original crash didn't produce a core file, but I could get one by > > attaching gdb later, to both the original build and a debug build. > > > > > >>> The underlying platform is: > >>> > >>> $ uname -a > >>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 EDT > >>> 2009 i686 i686 i386 GNU/Linux > >>> $ cat /etc/redhat-release > >>> CentOS release 5.4 (Final) > >>> > >>> $ free > >>> total used free shared buffers cached > >>> Mem: 3894000 1336012 2557988 0 144944 1004716 > >>> -/+ buffers/cache: 186352 3707648 > >>> Swap: 2031608 0 2031608 > >>> > >>> > >>> Procedure To Crash 389's ns-slapd: > >>> > >>> a) In the NDS 6.2 admin console, create a new replication agreement for > >>> the "o=This Big Net" subtree, and choose to "Create consumer > >>> initialization file". > >>> > >>> b) Copy the file to the 389 server. > >>> > >>> c) In the 389 2.6 admin console for the Directory Server, in the > >>> Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and > >>> choose "Initialize Database". Use the ldif file copied over. > >>> > >>> The ns-slapd process crashes, and I always get this in > >>> /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines: > >>> > >>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file > >>> "/home/cwood/tbn.ldif" > >>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck > >>> > >>> > >>> Other Details: > >>> > >>> > >>> I found two bugs with the str2entry_dupcheck string in it, but they don't > >>> seem pertinent: > >>> > >>> https://bugzilla.redhat.com/show_bug.cgi?id=548115 > >>> https://bugzilla.redhat.com/show_bug.cgi?id=243488 > >>> > >>> > >>> This says that str2entry_dupcheck could be about two things: > >>> > >>> http://docs.sun.com/source/816-6699-10/ax_errcd.html > >>> > >>> "While attempting to convert a string entry to an LDAP entry, the server > >>> found that the entry has no DN." > >>> > >>> "The server failed to add a value to the value tree." > >>> > >>> (But this is an exported database from NDS 6.2, and I'm fairly sure, > >>> without reading them all, that every entry will have a DN.) > >>> > >>> > >> The log message > >> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck > >> > >> is just trace information, not a report of a problem or error. > >> > >> Does the crash happen almost immediately? Or does it take a while? If > >> the problem happens quickly, it would be worthwhile to scan the first > >> couple of dozen entries looking for things like - entries without a DN - > >> attributes without a value > >> > > > > I checked, and I couldn't see any data errors of this type. > > > > > >>> If 389 is trying to check for duplicate entries, perhaps there are simply > >>> too many DNs? > >>> > >>> $ grep '^dn:' tbn.ldif | wc -l > >>> 636985 > >>> $ ls -lh acc.ldif > >>> -rw-r--r-- 1 cwood cwood 755M Mar 3 11:24 tbn.ldif > >>> > >>> > >> No. The server should be able to handle this much data easily. And it > >> must check for duplicate entries. > >> > >>> Per the instructions here: > >>> > >>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting > >>> > >>> I set my debug logging first to 24579: > >>> > >>> 1 Trace function calls > >>> 2 Debug packet handling > >>> 8192 Replication debugging > >>> 16384 Critical messages > >>> > >>> Then for the next try at reading logs I set it to 90115, the above plus: > >>> > >>> 65536 Plug-in debugging > >>> > >>> However, every time the log ended with the same set of lines noted above. > >>> > >>> > >> 1 Trace is really the best for this particular problem, and as you have > >> found it is limited for this particular problem. > >> > >> I think the next step would be to build the server with full debugging > >> information (use -g and omit -O2 or any other -Ox) and get a stack trace > >> with full debug information. > >> > > > > After recompiling I got different results. Still a crash, but after a > > different sequence of actions. This time I didn't create the other > > subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce > > the immediate crash, but it did crash after I tried to stop the import > > (rather than watch the errors noted below). > > > > I went back to my original compile (-g -O2), and also couldn't reproduce > > the instant crash under gdb. Here's the "where" output for that: > > > > (gdb) where > > #0 0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0 > > #1 0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6 > > #2 0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so > > #3 0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at > > ldap/servers/slapd/back-ldbm/cache.c:585 > > #4 0x00b46d9d in import_main_offline (arg=0xab704268) at > > ldap/servers/slapd/back-ldbm/import.c:1300 > > #5 0x00b47e4d in import_main (arg=0xab704268) at > > ldap/servers/slapd/back-ldbm/import.c:1386 > > #6 0x001fe6ed in ?? () from /usr/lib/libnspr4.so > > #7 0x0022f5ab in start_thread () from /lib/libpthread.so.0 > > #8 0x00d84cfe in clone () from /lib/libc.so.6 > > > > Everything below is for my new compile, with -g but no -O2. > > > > I used this to attach gdb to the processes: > > > > gdb /opt/dirsrv/sbin/ns-slapd 12799 > > > > I got a large number of these in the error log. Possibly this is one source > > of my problem, needing to move the schema from Netscape Directory Server > > 6.2 before I move the subtrees. > > > > [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, > > ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown > > object class "ldapauthvirtuseroc" > > [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping entry > > "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN > > Net" which violates schema, ending line 2182899 of file > > "/home/cwood/tbn.ldif" > > > This is definitely a source of your problems. You must update the > schema before you can import the data. Although the server should not > crash either. I was having some trouble updating just the schema, so I grossly hacked up migrate-ds.pl, DSMigration::fixup99user and DSMigration::migrateSchema. With that I managed to get the schema from the old Netscape Directory Server 6.2 99user.ldif to a new 389 1.2.5 98mycompany.ldif. The new schema showed through in the admin gui, so I take it that I was successful in getting my schema updated. I've been tearing my hair out trying to reproduce the crash and/or get my data imported, even with the schema added. (I eyeballed the schema and a failing directory entry, and they seemed to match to the naked eye.) Just when I gave up on trying to get it to crash, it did so once more (and not since then). However, I did get this, not sure if it's of any use: Program received signal SIGABRT, Aborted. [Switching to Thread 0x9bdd9b90 (LWP 10174)] 0x003cc402 in __kernel_vsyscall () (gdb) where #0 0x003cc402 in __kernel_vsyscall () #1 0x001a3df0 in raise () from /lib/libc.so.6 #2 0x001a5701 in abort () from /lib/libc.so.6 #3 0x001dc28b in __libc_message () from /lib/libc.so.6 #4 0x001e4595 in _int_free () from /lib/libc.so.6 #5 0x001e66ac in _int_realloc () from /lib/libc.so.6 #6 0x001e7276 in realloc () from /lib/libc.so.6 #7 0x00da781e in slapi_ch_realloc ( block=0x8b886170 "p\003��p\003��ha\210\213ha\210\213g entry \"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=This Realm\" which violates attribute syntax, ending line 4701511 of file \"/home/cwood/dump.ldi"..., size=6752) at ldap/servers/slapd/ch_malloc.c:199 #8 0x00e1184e in slapi_task_log_notice (task=0x8f97460, format=0xc14b2e "%s") at ldap/servers/slapd/task.c:254 #9 0x00bd7cce in import_log_notice (job=0xb3a28420, format=0xc15934 "WARNING: skipping entry \"%s\" which violates attribute syntax, ending line %d of file \"%s\"") at ldap/servers/slapd/back-ldbm/import.c:193 #10 0x00bdc7b9 in import_producer (param=0x8f977d0) at ldap/servers/slapd/back-ldbm/import-threads.c:601 #11 0x001686ed in ?? () from /usr/lib/libnspr4.so #12 0x002c55ab in start_thread () from /lib/libpthread.so.0 #13 0x0024ccfe in clone () from /lib/libc.so.6 > If you run the import job in gdb, and it crashes, gdb should report a > SIGSEGV. Since the import is multi-threaded, the crash may not be in > the current thread. In this case, you should dump the stack of all > threads like this: > (gdb) thread apply all bt Sadly I only remembered gdb "where", but if it crashes again I'll get this part right. > > When I tried to cancel my import the log ended here, and ns-slapd crashed. > > > > [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import > > threads... > > > > It took me a bit to figure out how to get a core file in gdb, but I > > eventually got one. > > > > This is the result of "back" when I didn't have a core file: > > > > (gdb) back > > #0 0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 > > "%s") at ldap/servers/slapd/task.c:231 > > #1 0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff "Import > > threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 > > #2 0x00a9dad7 in import_main_offline (arg=0x8a334f0) at > > ldap/servers/slapd/back-ldbm/import.c:1196 > > #3 0x00a9e01f in import_main (arg=0x8a334f0) at > > ldap/servers/slapd/back-ldbm/import.c:1386 > > #4 0x001fe6ed in ?? () from /usr/lib/libnspr4.so > > #5 0x002165ab in start_thread () from /lib/libpthread.so.0 > > #6 0x0081dcfe in clone () from /lib/libc.so.6 > > > > This is what gdb printed to the terminal when the SIGSEGVs came in: > > > > Program received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 0xb26fdb90 (LWP 13323)] > > 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 "%s") > > at ldap/servers/slapd/task.c:231 > > 231 len = 2 + strlen(buffer) + (task->task_log ? > > strlen(task->task_log) : 0); > > > > Program received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 0xb26fdb90 (LWP 14197)] > > 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 "%s") > > at ldap/servers/slapd/task.c:231 > > 231 len = 2 + strlen(buffer) + (task->task_log ? > > strlen(task->task_log) : 0); > > > > When I figured out getting a core dump with my debug build, this is "where": > > > > (gdb) where > > #0 0x00d2309b in strlen () from /lib/libc.so.6 > > #1 0x00d22e05 in strdup () from /lib/libc.so.6 > > #2 0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) at > > ldap/servers/slapd/ch_malloc.c:277 > > #3 0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at > > ldap/servers/slapd/dn.c:1229 > > #4 0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, old_pw=0x0) > > at ldap/servers/slapd/modify.c:582 > > #5 0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at > > ldap/servers/slapd/modify.c:526 > > #6 0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at > > ldap/servers/slapd/modify.c:416 > > #7 0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of > > bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626 > > #8 0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at > > ldap/servers/slapd/task.c:299 > > #9 0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 > > "%s") at ldap/servers/slapd/task.c:264 > > #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff > > "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 > > #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at > > ldap/servers/slapd/back-ldbm/import.c:1196 > > #12 0x00b8301f in import_main (arg=0xb0009318) at > > ldap/servers/slapd/back-ldbm/import.c:1386 > > #13 0x002516ed in ?? () from /usr/lib/libnspr4.so > > #14 0x001e65ab in start_thread () from /lib/libpthread.so.0 > > #15 0x00d84cfe in clone () from /lib/libc.so.6 > > > This is a known problem > *Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop > "initialize Database" crashes the server > > -- > > 389 users mailing list > > [email protected] > > https://admin.fedoraproject.org/mailman/listinfo/389-users > > > > -- > 389 users mailing list > [email protected] > https://admin.fedoraproject.org/mailman/listinfo/389-users -- 389 users mailing list [email protected] https://admin.fedoraproject.org/mailman/listinfo/389-users
