Re: [389-users] importing large subtree crashes ns-slapd

Christopher Wood Mon, 15 Mar 2010 11:41:01 -0700

On Thu, Mar 04, 2010 at 12:06:31PM -0700, Rich Megginson wrote:
> Christopher Wood wrote:
> > On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote:
> >   
> >> Christopher Wood wrote:
> >>     
> >>> I'm just getting started with 389 Directory Server (at work), and I've 
> >>> run into an issue that I'm not certain how to troubleshoot. I would 
> >>> greatly appreciate any assistance or tips you could offer, especially on 
> >>> where to look to see what's failing.
> >>>
> >>> Also, I apologize in advance for changing strings related to my 
> >>> employer's directory names and such, as I'm not comfortable with leaking 
> >>> that level information to a public list.
> >>>   
> >>>       
> >> As well you should be - you should always obscure sensitive information 
> >> like this.
> >>     
> >>> Overview:
> >>>
> >>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other 
> >>> subtrees are fine.
> >>>
> >>>
> >>> Top-Level Questions:
> >>>
> >>> 1) How do I stop ns-slapd from crashing?
> >>>   
> >>>       
> >> Good question.
> >>     
> >>> 2) How do I figure out what precisely is causing the crash? (With various 
> >>> levels of debug logging I get the same log entry.)
> >>>   
> >>>       
> >> You've already used the TRACE level (1) for logging - that's as verbose 
> >> as it gets for this particular operation.  Next step would be to try to 
> >> get a core file.
> >>     
> >>> 3) Is it possible to simply import my initialization ldif without 
> >>> duplication checks?
> >>>   
> >>>       
> >> No.
> >>     
> >>> Background:
> >>>
> >>> At work we have NDS 6.2 (single master on a physical server, virtual 
> >>> machine slaves), and would like to move our directories intact to a 389 
> >>> 2.6 installation via replication.
> >>>   
> >>>       
> >> What platform/OS?  32-bit or 64-bit?  By NDS 6.2 I'm assuming you mean 
> >> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 
> >> should be hitting the mirrors tomorrow).
> >>     
> >
> > 32 bit
> >
> >   
> >>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 with 
> >>> no difficulties.
> >>>
> >>> I compiled our 389 installation from the source packages downloaded from 
> >>> http://directory.fedoraproject.org/wiki/Source.
> >>>       
> >> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2?
> >>     
> >
> > I used 1.2.6.a1 to compile originally and produce core files to answer your 
> > questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the same 
> > version when trying to initially reproduce something.
> >
> >   
> >> What compiler flags did you use?
> >>     
> >
> > The makefile that came out of ./configure had these:
> >
> > CCASFLAGS = -g -O2
> > CFLAGS = -g -O2
> > CXXFLAGS = -g -O2
> >
> > For the plain debug build I edited that to insert these and rebuilt with 
> > make, make install:
> >
> > CCASFLAGS = -g
> > CFLAGS = -g
> > CXXFLAGS = -g
> >
> > (Fair warning that I'm not a programmer, so I'm not entirely sure doing 
> > that was right.)
> >
> >   
> Note that you don't have to edit the Makefile - you can do a make 
> distclean, then run configure like this:
>  > CFLAGS="-g" /path/to/configure ...
>  > make
> 
> But that looks right, anyway.  Note that if you change the flags like 
> this by editing the makefile, you will have to do a make clean to remove 
> the old object files, so that they will be rebuilt with the new flags.


I did a make clean before rebuilding in this case. For my later debug builds 
I've used your step to re-run configure.

> >> Do you have a core file?  If so, try using gdb
> >> gdb /path/to/ns-slapd /path/to/core.pid
> >> once in gdb, type the "where" command
> >> (gdb) where
> >>     
> >
> > The original crash didn't produce a core file, but I could get one by 
> > attaching gdb later, to both the original build and a debug build.
> >
> >   
> >>> The underlying platform is:
> >>>
> >>> $ uname -a
> >>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 EDT 
> >>> 2009 i686 i686 i386 GNU/Linux
> >>> $ cat /etc/redhat-release 
> >>> CentOS release 5.4 (Final)
> >>>
> >>> $ free
> >>>              total       used       free     shared    buffers     cached
> >>> Mem:       3894000    1336012    2557988          0     144944    1004716
> >>> -/+ buffers/cache:     186352    3707648
> >>> Swap:      2031608          0    2031608
> >>>
> >>>
> >>> Procedure To Crash 389's ns-slapd:
> >>>
> >>> a) In the NDS 6.2 admin console, create a new replication agreement for 
> >>> the "o=This Big Net" subtree, and choose to "Create consumer 
> >>> initialization file".
> >>>
> >>> b) Copy the file to the 389 server.
> >>>
> >>> c) In the 389 2.6 admin console for the Directory Server, in the 
> >>> Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and 
> >>> choose "Initialize Database". Use the ldif file copied over.
> >>>
> >>> The ns-slapd process crashes, and I always get this in 
> >>> /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines:
> >>>
> >>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file 
> >>> "/home/cwood/tbn.ldif"
> >>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
> >>>
> >>>
> >>> Other Details:
> >>>
> >>>
> >>> I found two bugs with the str2entry_dupcheck string in it, but they don't 
> >>> seem pertinent:
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=548115
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=243488
> >>>
> >>>
> >>> This says that str2entry_dupcheck could be about two things:
> >>>
> >>> http://docs.sun.com/source/816-6699-10/ax_errcd.html
> >>>
> >>> "While attempting to convert a string entry to an LDAP entry, the server 
> >>> found that the entry has no DN."
> >>>
> >>> "The server failed to add a value to the value tree."
> >>>
> >>> (But this is an exported database from NDS 6.2, and I'm fairly sure, 
> >>> without reading them all, that every entry will have a DN.)
> >>>   
> >>>       
> >> The log message
> >> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
> >>
> >> is just trace information, not a report of a problem or error.
> >>
> >> Does the crash happen almost immediately?  Or does it take a while?  If 
> >> the problem happens quickly, it would be worthwhile to scan the first 
> >> couple of dozen entries looking for things like - entries without a DN - 
> >> attributes without a value
> >>     
> >
> > I checked, and I couldn't see any data errors of this type.
> >  
> >   
> >>> If 389 is trying to check for duplicate entries, perhaps there are simply 
> >>> too many DNs?
> >>>
> >>> $ grep '^dn:' tbn.ldif | wc -l
> >>> 636985
> >>> $ ls -lh acc.ldif 
> >>> -rw-r--r-- 1 cwood cwood 755M Mar  3 11:24 tbn.ldif
> >>>   
> >>>       
> >> No.  The server should be able to handle this much data easily.  And it 
> >> must check for duplicate entries.
> >>     
> >>> Per the instructions here:
> >>>
> >>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting
> >>>
> >>> I set my debug logging first to 24579:
> >>>
> >>> 1          Trace function calls 
> >>> 2          Debug packet handling 
> >>> 8192       Replication debugging 
> >>> 16384      Critical messages
> >>>
> >>> Then for the next try at reading logs I set it to 90115, the above plus:
> >>>
> >>> 65536      Plug-in debugging
> >>>
> >>> However, every time the log ended with the same set of lines noted above.
> >>>   
> >>>       
> >> 1 Trace is really the best for this particular problem, and as you have 
> >> found it is limited for this particular problem.
> >>
> >> I think the next step would be to build the server with full debugging 
> >> information (use -g and omit -O2 or any other -Ox) and get a stack trace 
> >> with full debug information.
> >>     
> >
> > After recompiling I got different results. Still a crash, but after a 
> > different sequence of actions. This time I didn't create the other 
> > subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce 
> > the immediate crash, but it did crash after I tried to stop the import 
> > (rather than watch the errors noted below).
> >
> > I went back to my original compile (-g -O2), and also couldn't reproduce 
> > the instant crash under gdb. Here's the "where" output for that:
> >
> > (gdb) where
> > #0  0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0
> > #1  0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6
> > #2  0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so
> > #3  0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at 
> > ldap/servers/slapd/back-ldbm/cache.c:585
> > #4  0x00b46d9d in import_main_offline (arg=0xab704268) at 
> > ldap/servers/slapd/back-ldbm/import.c:1300
> > #5  0x00b47e4d in import_main (arg=0xab704268) at 
> > ldap/servers/slapd/back-ldbm/import.c:1386
> > #6  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
> > #7  0x0022f5ab in start_thread () from /lib/libpthread.so.0
> > #8  0x00d84cfe in clone () from /lib/libc.so.6
> >
> > Everything below is for my new compile, with -g but no -O2.
> >
> > I used this to attach gdb to the processes:
> >
> > gdb /opt/dirsrv/sbin/ns-slapd 12799
> >
> > I got a large number of these in the error log. Possibly this is one source 
> > of my problem, needing to move the schema from Netscape Directory Server 
> > 6.2 before I move the subtrees.
> >
> > [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, 
> > ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown 
> > object class "ldapauthvirtuseroc"
> > [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping entry 
> > "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN 
> > Net" which violates schema, ending line 2182899 of file 
> > "/home/cwood/tbn.ldif"
> >   
> This is definitely a source of your problems.  You must update the 
> schema before you can import the data.  Although the server should not 
> crash either.

I was having some trouble updating just the schema, so I grossly hacked up 
migrate-ds.pl, DSMigration::fixup99user and DSMigration::migrateSchema. With 
that I managed to get the schema from the old Netscape Directory Server 6.2 
99user.ldif to a new 389 1.2.5 98mycompany.ldif. The new schema showed through 
in the admin gui, so I take it that I was successful in getting my schema 
updated.

I've been tearing my hair out trying to reproduce the crash and/or get my data 
imported, even with the schema added. (I eyeballed the schema and a failing 
directory entry, and they seemed to match to the naked eye.) Just when I gave 
up on trying to get it to crash, it did so once more (and not since then). 
However, I did get this, not sure if it's of any use:

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x9bdd9b90 (LWP 10174)]
0x003cc402 in __kernel_vsyscall ()
(gdb) where
#0  0x003cc402 in __kernel_vsyscall ()
#1  0x001a3df0 in raise () from /lib/libc.so.6
#2  0x001a5701 in abort () from /lib/libc.so.6
#3  0x001dc28b in __libc_message () from /lib/libc.so.6
#4  0x001e4595 in _int_free () from /lib/libc.so.6
#5  0x001e66ac in _int_realloc () from /lib/libc.so.6
#6  0x001e7276 in realloc () from /lib/libc.so.6
#7  0x00da781e in slapi_ch_realloc (
    block=0x8b886170 "p\003��p\003��ha\210\213ha\210\213g entry 
\"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, 
o=This Realm\" which violates attribute syntax, ending line 4701511 of file 
\"/home/cwood/dump.ldi"...,
    size=6752) at ldap/servers/slapd/ch_malloc.c:199
#8  0x00e1184e in slapi_task_log_notice (task=0x8f97460, format=0xc14b2e "%s")
    at ldap/servers/slapd/task.c:254
#9  0x00bd7cce in import_log_notice (job=0xb3a28420,
    format=0xc15934 "WARNING: skipping entry \"%s\" which violates attribute 
syntax, ending line %d of file \"%s\"") at 
ldap/servers/slapd/back-ldbm/import.c:193
#10 0x00bdc7b9 in import_producer (param=0x8f977d0)
    at ldap/servers/slapd/back-ldbm/import-threads.c:601
#11 0x001686ed in ?? () from /usr/lib/libnspr4.so
#12 0x002c55ab in start_thread () from /lib/libpthread.so.0
#13 0x0024ccfe in clone () from /lib/libc.so.6
 
> If you run the import job in gdb, and it crashes, gdb should report a 
> SIGSEGV.  Since the import is multi-threaded, the crash may not be in 
> the current thread.  In this case, you should dump the stack of all 
> threads like this:
> (gdb) thread apply all bt

Sadly I only remembered gdb "where", but if it crashes again I'll get this part 
right.

> > When I tried to cancel my import the log ended here, and ns-slapd crashed.
> >
> > [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import 
> > threads...
> >
> > It took me a bit to figure out how to get a core file in gdb, but I 
> > eventually got one.
> >
> > This is the result of "back" when I didn't have a core file:
> >
> > (gdb) back
> > #0  0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 
> > "%s") at ldap/servers/slapd/task.c:231
> > #1  0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff "Import 
> > threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
> > #2  0x00a9dad7 in import_main_offline (arg=0x8a334f0) at 
> > ldap/servers/slapd/back-ldbm/import.c:1196
> > #3  0x00a9e01f in import_main (arg=0x8a334f0) at 
> > ldap/servers/slapd/back-ldbm/import.c:1386
> > #4  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
> > #5  0x002165ab in start_thread () from /lib/libpthread.so.0
> > #6  0x0081dcfe in clone () from /lib/libc.so.6
> >
> > This is what gdb printed to the terminal when the SIGSEGVs came in:
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0xb26fdb90 (LWP 13323)]
> > 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 "%s") 
> > at ldap/servers/slapd/task.c:231
> > 231         len = 2 + strlen(buffer) + (task->task_log ? 
> > strlen(task->task_log) : 0);
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0xb26fdb90 (LWP 14197)]
> > 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 "%s") 
> > at ldap/servers/slapd/task.c:231
> > 231         len = 2 + strlen(buffer) + (task->task_log ? 
> > strlen(task->task_log) : 0);
> >
> > When I figured out getting a core dump with my debug build, this is "where":
> >
> > (gdb) where
> > #0  0x00d2309b in strlen () from /lib/libc.so.6
> > #1  0x00d22e05 in strdup () from /lib/libc.so.6
> > #2  0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) at 
> > ldap/servers/slapd/ch_malloc.c:277
> > #3  0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at 
> > ldap/servers/slapd/dn.c:1229
> > #4  0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, old_pw=0x0) 
> > at ldap/servers/slapd/modify.c:582
> > #5  0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at 
> > ldap/servers/slapd/modify.c:526
> > #6  0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at 
> > ldap/servers/slapd/modify.c:416
> > #7  0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of 
> > bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626
> > #8  0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at 
> > ldap/servers/slapd/task.c:299
> > #9  0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 
> > "%s") at ldap/servers/slapd/task.c:264
> > #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff 
> > "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
> > #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at 
> > ldap/servers/slapd/back-ldbm/import.c:1196
> > #12 0x00b8301f in import_main (arg=0xb0009318) at 
> > ldap/servers/slapd/back-ldbm/import.c:1386
> > #13 0x002516ed in ?? () from /usr/lib/libnspr4.so
> > #14 0x001e65ab in start_thread () from /lib/libpthread.so.0
> > #15 0x00d84cfe in clone () from /lib/libc.so.6
> >   
> This is a known problem
> *Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop 
> "initialize Database" crashes the server
> > --
> > 389 users mailing list
> > [email protected]
> > https://admin.fedoraproject.org/mailman/listinfo/389-users
> >   
> 
> --
> 389 users mailing list
> [email protected]
> https://admin.fedoraproject.org/mailman/listinfo/389-users
--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] importing large subtree crashes ns-slapd

Reply via email to