Christopher Wood wrote: > On Thu, Mar 04, 2010 at 12:06:31PM -0700, Rich Megginson wrote: > >> Christopher Wood wrote: >> >>> On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote: >>> >>> >>>> Christopher Wood wrote: >>>> >>>> >>>>> I'm just getting started with 389 Directory Server (at work), and I've >>>>> run into an issue that I'm not certain how to troubleshoot. I would >>>>> greatly appreciate any assistance or tips you could offer, especially on >>>>> where to look to see what's failing. >>>>> >>>>> Also, I apologize in advance for changing strings related to my >>>>> employer's directory names and such, as I'm not comfortable with leaking >>>>> that level information to a public list. >>>>> >>>>> >>>>> >>>> As well you should be - you should always obscure sensitive information >>>> like this. >>>> >>>> >>>>> Overview: >>>>> >>>>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other >>>>> subtrees are fine. >>>>> >>>>> >>>>> Top-Level Questions: >>>>> >>>>> 1) How do I stop ns-slapd from crashing? >>>>> >>>>> >>>>> >>>> Good question. >>>> >>>> >>>>> 2) How do I figure out what precisely is causing the crash? (With various >>>>> levels of debug logging I get the same log entry.) >>>>> >>>>> >>>>> >>>> You've already used the TRACE level (1) for logging - that's as verbose >>>> as it gets for this particular operation. Next step would be to try to >>>> get a core file. >>>> >>>> >>>>> 3) Is it possible to simply import my initialization ldif without >>>>> duplication checks? >>>>> >>>>> >>>>> >>>> No. >>>> >>>> >>>>> Background: >>>>> >>>>> At work we have NDS 6.2 (single master on a physical server, virtual >>>>> machine slaves), and would like to move our directories intact to a 389 >>>>> 2.6 installation via replication. >>>>> >>>>> >>>>> >>>> What platform/OS? 32-bit or 64-bit? By NDS 6.2 I'm assuming you mean >>>> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 >>>> should be hitting the mirrors tomorrow). >>>> >>>> >>> 32 bit >>> >>> >>> >>>>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 with >>>>> no difficulties. >>>>> >>>>> I compiled our 389 installation from the source packages downloaded from >>>>> http://directory.fedoraproject.org/wiki/Source. >>>>> >>>>> >>>> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2? >>>> >>>> >>> I used 1.2.6.a1 to compile originally and produce core files to answer your >>> questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the same >>> version when trying to initially reproduce something. >>> >>> >>> >>>> What compiler flags did you use? >>>> >>>> >>> The makefile that came out of ./configure had these: >>> >>> CCASFLAGS = -g -O2 >>> CFLAGS = -g -O2 >>> CXXFLAGS = -g -O2 >>> >>> For the plain debug build I edited that to insert these and rebuilt with >>> make, make install: >>> >>> CCASFLAGS = -g >>> CFLAGS = -g >>> CXXFLAGS = -g >>> >>> (Fair warning that I'm not a programmer, so I'm not entirely sure doing >>> that was right.) >>> >>> >>> >> Note that you don't have to edit the Makefile - you can do a make >> distclean, then run configure like this: >> > CFLAGS="-g" /path/to/configure ... >> > make >> >> But that looks right, anyway. Note that if you change the flags like >> this by editing the makefile, you will have to do a make clean to remove >> the old object files, so that they will be rebuilt with the new flags. >> > > I did a make clean before rebuilding in this case. For my later debug builds > I've used your step to re-run configure. > > >>>> Do you have a core file? If so, try using gdb >>>> gdb /path/to/ns-slapd /path/to/core.pid >>>> once in gdb, type the "where" command >>>> (gdb) where >>>> >>>> >>> The original crash didn't produce a core file, but I could get one by >>> attaching gdb later, to both the original build and a debug build. >>> >>> >>> >>>>> The underlying platform is: >>>>> >>>>> $ uname -a >>>>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 EDT >>>>> 2009 i686 i686 i386 GNU/Linux >>>>> $ cat /etc/redhat-release >>>>> CentOS release 5.4 (Final) >>>>> >>>>> $ free >>>>> total used free shared buffers cached >>>>> Mem: 3894000 1336012 2557988 0 144944 1004716 >>>>> -/+ buffers/cache: 186352 3707648 >>>>> Swap: 2031608 0 2031608 >>>>> >>>>> >>>>> Procedure To Crash 389's ns-slapd: >>>>> >>>>> a) In the NDS 6.2 admin console, create a new replication agreement for >>>>> the "o=This Big Net" subtree, and choose to "Create consumer >>>>> initialization file". >>>>> >>>>> b) Copy the file to the 389 server. >>>>> >>>>> c) In the 389 2.6 admin console for the Directory Server, in the >>>>> Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and >>>>> choose "Initialize Database". Use the ldif file copied over. >>>>> >>>>> The ns-slapd process crashes, and I always get this in >>>>> /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines: >>>>> >>>>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file >>>>> "/home/cwood/tbn.ldif" >>>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck >>>>> >>>>> >>>>> Other Details: >>>>> >>>>> >>>>> I found two bugs with the str2entry_dupcheck string in it, but they don't >>>>> seem pertinent: >>>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=548115 >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=243488 >>>>> >>>>> >>>>> This says that str2entry_dupcheck could be about two things: >>>>> >>>>> http://docs.sun.com/source/816-6699-10/ax_errcd.html >>>>> >>>>> "While attempting to convert a string entry to an LDAP entry, the server >>>>> found that the entry has no DN." >>>>> >>>>> "The server failed to add a value to the value tree." >>>>> >>>>> (But this is an exported database from NDS 6.2, and I'm fairly sure, >>>>> without reading them all, that every entry will have a DN.) >>>>> >>>>> >>>>> >>>> The log message >>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck >>>> >>>> is just trace information, not a report of a problem or error. >>>> >>>> Does the crash happen almost immediately? Or does it take a while? If >>>> the problem happens quickly, it would be worthwhile to scan the first >>>> couple of dozen entries looking for things like - entries without a DN - >>>> attributes without a value >>>> >>>> >>> I checked, and I couldn't see any data errors of this type. >>> >>> >>> >>>>> If 389 is trying to check for duplicate entries, perhaps there are simply >>>>> too many DNs? >>>>> >>>>> $ grep '^dn:' tbn.ldif | wc -l >>>>> 636985 >>>>> $ ls -lh acc.ldif >>>>> -rw-r--r-- 1 cwood cwood 755M Mar 3 11:24 tbn.ldif >>>>> >>>>> >>>>> >>>> No. The server should be able to handle this much data easily. And it >>>> must check for duplicate entries. >>>> >>>> >>>>> Per the instructions here: >>>>> >>>>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting >>>>> >>>>> I set my debug logging first to 24579: >>>>> >>>>> 1 Trace function calls >>>>> 2 Debug packet handling >>>>> 8192 Replication debugging >>>>> 16384 Critical messages >>>>> >>>>> Then for the next try at reading logs I set it to 90115, the above plus: >>>>> >>>>> 65536 Plug-in debugging >>>>> >>>>> However, every time the log ended with the same set of lines noted above. >>>>> >>>>> >>>>> >>>> 1 Trace is really the best for this particular problem, and as you have >>>> found it is limited for this particular problem. >>>> >>>> I think the next step would be to build the server with full debugging >>>> information (use -g and omit -O2 or any other -Ox) and get a stack trace >>>> with full debug information. >>>> >>>> >>> After recompiling I got different results. Still a crash, but after a >>> different sequence of actions. This time I didn't create the other >>> subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce >>> the immediate crash, but it did crash after I tried to stop the import >>> (rather than watch the errors noted below). >>> >>> I went back to my original compile (-g -O2), and also couldn't reproduce >>> the instant crash under gdb. Here's the "where" output for that: >>> >>> (gdb) where >>> #0 0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0 >>> #1 0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6 >>> #2 0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so >>> #3 0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at >>> ldap/servers/slapd/back-ldbm/cache.c:585 >>> #4 0x00b46d9d in import_main_offline (arg=0xab704268) at >>> ldap/servers/slapd/back-ldbm/import.c:1300 >>> #5 0x00b47e4d in import_main (arg=0xab704268) at >>> ldap/servers/slapd/back-ldbm/import.c:1386 >>> #6 0x001fe6ed in ?? () from /usr/lib/libnspr4.so >>> #7 0x0022f5ab in start_thread () from /lib/libpthread.so.0 >>> #8 0x00d84cfe in clone () from /lib/libc.so.6 >>> >>> Everything below is for my new compile, with -g but no -O2. >>> >>> I used this to attach gdb to the processes: >>> >>> gdb /opt/dirsrv/sbin/ns-slapd 12799 >>> >>> I got a large number of these in the error log. Possibly this is one source >>> of my problem, needing to move the schema from Netscape Directory Server >>> 6.2 before I move the subtrees. >>> >>> [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, >>> ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown >>> object class "ldapauthvirtuseroc" >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping entry >>> "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN >>> Net" which violates schema, ending line 2182899 of file >>> "/home/cwood/tbn.ldif" >>> >>> >> This is definitely a source of your problems. You must update the >> schema before you can import the data. Although the server should not >> crash either. >> > > I was having some trouble updating just the schema, so I grossly hacked up > migrate-ds.pl, DSMigration::fixup99user and DSMigration::migrateSchema. With > that I managed to get the schema from the old Netscape Directory Server 6.2 > 99user.ldif to a new 389 1.2.5 98mycompany.ldif. The new schema showed > through in the admin gui, so I take it that I was successful in getting my > schema updated. > > I've been tearing my hair out trying to reproduce the crash and/or get my > data imported, even with the schema added. (I eyeballed the schema and a > failing directory entry, and they seemed to match to the naked eye.) Even though the server should not crash, I believe that if we could get the syntax problem taken care of, you could import your data without a crash (according to the stack trace below).
The message appears to be entry \"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=This Realm\" which violates attribute syntax, ending line 4701511 of file \"/home/cwood/dump.ld Would it be possible for you to post excerpts from that file at or around that line/entry, along with the definition of the attributes used? > Just when I gave up on trying to get it to crash, it did so once more (and > not since then). However, I did get this, not sure if it's of any use: > > Program received signal SIGABRT, Aborted. > [Switching to Thread 0x9bdd9b90 (LWP 10174)] > 0x003cc402 in __kernel_vsyscall () > (gdb) where > #0 0x003cc402 in __kernel_vsyscall () > #1 0x001a3df0 in raise () from /lib/libc.so.6 > #2 0x001a5701 in abort () from /lib/libc.so.6 > #3 0x001dc28b in __libc_message () from /lib/libc.so.6 > #4 0x001e4595 in _int_free () from /lib/libc.so.6 > #5 0x001e66ac in _int_realloc () from /lib/libc.so.6 > #6 0x001e7276 in realloc () from /lib/libc.so.6 > #7 0x00da781e in slapi_ch_realloc ( > block=0x8b886170 "p\003��p\003��ha\210\213ha\210\213g entry > \"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, > o=This Realm\" which violates attribute syntax, ending line 4701511 of file > \"/home/cwood/dump.ldi"..., > size=6752) at ldap/servers/slapd/ch_malloc.c:199 > #8 0x00e1184e in slapi_task_log_notice (task=0x8f97460, format=0xc14b2e "%s") > at ldap/servers/slapd/task.c:254 > #9 0x00bd7cce in import_log_notice (job=0xb3a28420, > format=0xc15934 "WARNING: skipping entry \"%s\" which violates attribute > syntax, ending line %d of file \"%s\"") at > ldap/servers/slapd/back-ldbm/import.c:193 > #10 0x00bdc7b9 in import_producer (param=0x8f977d0) > at ldap/servers/slapd/back-ldbm/import-threads.c:601 > #11 0x001686ed in ?? () from /usr/lib/libnspr4.so > #12 0x002c55ab in start_thread () from /lib/libpthread.so.0 > #13 0x0024ccfe in clone () from /lib/libc.so.6 > This is definitely a bug - please file a bug at https://bugzilla.redhat.com/enter_bug.cgi?product=389 and please add the above stack trace and any other information. This looks like a double free, or a free/realloc of un-malloc'd memory. Would it be possible for you to run the server using valgrind? I have a script here http://github.com/richm/scripts called ns-slapd.sh that you can use to replace ns-slapd (moving it to ns-slapd.orig) and run the server/import/whatever using valgrind. > > >> If you run the import job in gdb, and it crashes, gdb should report a >> SIGSEGV. Since the import is multi-threaded, the crash may not be in >> the current thread. In this case, you should dump the stack of all >> threads like this: >> (gdb) thread apply all bt >> > > Sadly I only remembered gdb "where", but if it crashes again I'll get this > part right. > > >>> When I tried to cancel my import the log ended here, and ns-slapd crashed. >>> >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import >>> threads... >>> >>> It took me a bit to figure out how to get a core file in gdb, but I >>> eventually got one. >>> >>> This is the result of "back" when I didn't have a core file: >>> >>> (gdb) back >>> #0 0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 >>> "%s") at ldap/servers/slapd/task.c:231 >>> #1 0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff "Import >>> threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 >>> #2 0x00a9dad7 in import_main_offline (arg=0x8a334f0) at >>> ldap/servers/slapd/back-ldbm/import.c:1196 >>> #3 0x00a9e01f in import_main (arg=0x8a334f0) at >>> ldap/servers/slapd/back-ldbm/import.c:1386 >>> #4 0x001fe6ed in ?? () from /usr/lib/libnspr4.so >>> #5 0x002165ab in start_thread () from /lib/libpthread.so.0 >>> #6 0x0081dcfe in clone () from /lib/libc.so.6 >>> >>> This is what gdb printed to the terminal when the SIGSEGVs came in: >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> [Switching to Thread 0xb26fdb90 (LWP 13323)] >>> 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 "%s") >>> at ldap/servers/slapd/task.c:231 >>> 231 len = 2 + strlen(buffer) + (task->task_log ? >>> strlen(task->task_log) : 0); >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> [Switching to Thread 0xb26fdb90 (LWP 14197)] >>> 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 "%s") >>> at ldap/servers/slapd/task.c:231 >>> 231 len = 2 + strlen(buffer) + (task->task_log ? >>> strlen(task->task_log) : 0); >>> >>> When I figured out getting a core dump with my debug build, this is "where": >>> >>> (gdb) where >>> #0 0x00d2309b in strlen () from /lib/libc.so.6 >>> #1 0x00d22e05 in strdup () from /lib/libc.so.6 >>> #2 0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) at >>> ldap/servers/slapd/ch_malloc.c:277 >>> #3 0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at >>> ldap/servers/slapd/dn.c:1229 >>> #4 0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, old_pw=0x0) >>> at ldap/servers/slapd/modify.c:582 >>> #5 0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at >>> ldap/servers/slapd/modify.c:526 >>> #6 0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at >>> ldap/servers/slapd/modify.c:416 >>> #7 0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of >>> bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626 >>> #8 0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at >>> ldap/servers/slapd/task.c:299 >>> #9 0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 >>> "%s") at ldap/servers/slapd/task.c:264 >>> #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff >>> "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 >>> #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at >>> ldap/servers/slapd/back-ldbm/import.c:1196 >>> #12 0x00b8301f in import_main (arg=0xb0009318) at >>> ldap/servers/slapd/back-ldbm/import.c:1386 >>> #13 0x002516ed in ?? () from /usr/lib/libnspr4.so >>> #14 0x001e65ab in start_thread () from /lib/libpthread.so.0 >>> #15 0x00d84cfe in clone () from /lib/libc.so.6 >>> >>> >> This is a known problem >> *Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop >> "initialize Database" crashes the server >> >>> -- >>> 389 users mailing list >>> 389-us...@lists.fedoraproject.org >>> https://admin.fedoraproject.org/mailman/listinfo/389-users >>> >>> >> -- >> 389 users mailing list >> 389-us...@lists.fedoraproject.org >> https://admin.fedoraproject.org/mailman/listinfo/389-users >> > -- > 389 users mailing list > 389-us...@lists.fedoraproject.org > https://admin.fedoraproject.org/mailman/listinfo/389-users -- 389 users mailing list 389-us...@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users