Re: [389-users] importing large subtree crashes ns-slapd

Rich Megginson Mon, 15 Mar 2010 11:58:22 -0700

Christopher Wood wrote:
> On Thu, Mar 04, 2010 at 12:06:31PM -0700, Rich Megginson wrote:
>   
>> Christopher Wood wrote:
>>     
>>> On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote:
>>>   
>>>       
>>>> Christopher Wood wrote:
>>>>     
>>>>         
>>>>> I'm just getting started with 389 Directory Server (at work), and I've 
>>>>> run into an issue that I'm not certain how to troubleshoot. I would 
>>>>> greatly appreciate any assistance or tips you could offer, especially on 
>>>>> where to look to see what's failing.
>>>>>
>>>>> Also, I apologize in advance for changing strings related to my 
>>>>> employer's directory names and such, as I'm not comfortable with leaking 
>>>>> that level information to a public list.
>>>>>   
>>>>>       
>>>>>           
>>>> As well you should be - you should always obscure sensitive information 
>>>> like this.
>>>>     
>>>>         
>>>>> Overview:
>>>>>
>>>>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other 
>>>>> subtrees are fine.
>>>>>
>>>>>
>>>>> Top-Level Questions:
>>>>>
>>>>> 1) How do I stop ns-slapd from crashing?
>>>>>   
>>>>>       
>>>>>           
>>>> Good question.
>>>>     
>>>>         
>>>>> 2) How do I figure out what precisely is causing the crash? (With various 
>>>>> levels of debug logging I get the same log entry.)
>>>>>   
>>>>>       
>>>>>           
>>>> You've already used the TRACE level (1) for logging - that's as verbose 
>>>> as it gets for this particular operation.  Next step would be to try to 
>>>> get a core file.
>>>>     
>>>>         
>>>>> 3) Is it possible to simply import my initialization ldif without 
>>>>> duplication checks?
>>>>>   
>>>>>       
>>>>>           
>>>> No.
>>>>     
>>>>         
>>>>> Background:
>>>>>
>>>>> At work we have NDS 6.2 (single master on a physical server, virtual 
>>>>> machine slaves), and would like to move our directories intact to a 389 
>>>>> 2.6 installation via replication.
>>>>>   
>>>>>       
>>>>>           
>>>> What platform/OS?  32-bit or 64-bit?  By NDS 6.2 I'm assuming you mean 
>>>> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 
>>>> should be hitting the mirrors tomorrow).
>>>>     
>>>>         
>>> 32 bit
>>>
>>>   
>>>       
>>>>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 with 
>>>>> no difficulties.
>>>>>
>>>>> I compiled our 389 installation from the source packages downloaded from 
>>>>> http://directory.fedoraproject.org/wiki/Source.
>>>>>       
>>>>>           
>>>> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2?
>>>>     
>>>>         
>>> I used 1.2.6.a1 to compile originally and produce core files to answer your 
>>> questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the same 
>>> version when trying to initially reproduce something.
>>>
>>>   
>>>       
>>>> What compiler flags did you use?
>>>>     
>>>>         
>>> The makefile that came out of ./configure had these:
>>>
>>> CCASFLAGS = -g -O2
>>> CFLAGS = -g -O2
>>> CXXFLAGS = -g -O2
>>>
>>> For the plain debug build I edited that to insert these and rebuilt with 
>>> make, make install:
>>>
>>> CCASFLAGS = -g
>>> CFLAGS = -g
>>> CXXFLAGS = -g
>>>
>>> (Fair warning that I'm not a programmer, so I'm not entirely sure doing 
>>> that was right.)
>>>
>>>   
>>>       
>> Note that you don't have to edit the Makefile - you can do a make 
>> distclean, then run configure like this:
>>  > CFLAGS="-g" /path/to/configure ...
>>  > make
>>
>> But that looks right, anyway.  Note that if you change the flags like 
>> this by editing the makefile, you will have to do a make clean to remove 
>> the old object files, so that they will be rebuilt with the new flags.
>>     
>
> I did a make clean before rebuilding in this case. For my later debug builds 
> I've used your step to re-run configure.
>
>   
>>>> Do you have a core file?  If so, try using gdb
>>>> gdb /path/to/ns-slapd /path/to/core.pid
>>>> once in gdb, type the "where" command
>>>> (gdb) where
>>>>     
>>>>         
>>> The original crash didn't produce a core file, but I could get one by 
>>> attaching gdb later, to both the original build and a debug build.
>>>
>>>   
>>>       
>>>>> The underlying platform is:
>>>>>
>>>>> $ uname -a
>>>>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 EDT 
>>>>> 2009 i686 i686 i386 GNU/Linux
>>>>> $ cat /etc/redhat-release 
>>>>> CentOS release 5.4 (Final)
>>>>>
>>>>> $ free
>>>>>              total       used       free     shared    buffers     cached
>>>>> Mem:       3894000    1336012    2557988          0     144944    1004716
>>>>> -/+ buffers/cache:     186352    3707648
>>>>> Swap:      2031608          0    2031608
>>>>>
>>>>>
>>>>> Procedure To Crash 389's ns-slapd:
>>>>>
>>>>> a) In the NDS 6.2 admin console, create a new replication agreement for 
>>>>> the "o=This Big Net" subtree, and choose to "Create consumer 
>>>>> initialization file".
>>>>>
>>>>> b) Copy the file to the 389 server.
>>>>>
>>>>> c) In the 389 2.6 admin console for the Directory Server, in the 
>>>>> Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and 
>>>>> choose "Initialize Database". Use the ldif file copied over.
>>>>>
>>>>> The ns-slapd process crashes, and I always get this in 
>>>>> /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines:
>>>>>
>>>>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file 
>>>>> "/home/cwood/tbn.ldif"
>>>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
>>>>>
>>>>>
>>>>> Other Details:
>>>>>
>>>>>
>>>>> I found two bugs with the str2entry_dupcheck string in it, but they don't 
>>>>> seem pertinent:
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=548115
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=243488
>>>>>
>>>>>
>>>>> This says that str2entry_dupcheck could be about two things:
>>>>>
>>>>> http://docs.sun.com/source/816-6699-10/ax_errcd.html
>>>>>
>>>>> "While attempting to convert a string entry to an LDAP entry, the server 
>>>>> found that the entry has no DN."
>>>>>
>>>>> "The server failed to add a value to the value tree."
>>>>>
>>>>> (But this is an exported database from NDS 6.2, and I'm fairly sure, 
>>>>> without reading them all, that every entry will have a DN.)
>>>>>   
>>>>>       
>>>>>           
>>>> The log message
>>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
>>>>
>>>> is just trace information, not a report of a problem or error.
>>>>
>>>> Does the crash happen almost immediately?  Or does it take a while?  If 
>>>> the problem happens quickly, it would be worthwhile to scan the first 
>>>> couple of dozen entries looking for things like - entries without a DN - 
>>>> attributes without a value
>>>>     
>>>>         
>>> I checked, and I couldn't see any data errors of this type.
>>>  
>>>   
>>>       
>>>>> If 389 is trying to check for duplicate entries, perhaps there are simply 
>>>>> too many DNs?
>>>>>
>>>>> $ grep '^dn:' tbn.ldif | wc -l
>>>>> 636985
>>>>> $ ls -lh acc.ldif 
>>>>> -rw-r--r-- 1 cwood cwood 755M Mar  3 11:24 tbn.ldif
>>>>>   
>>>>>       
>>>>>           
>>>> No.  The server should be able to handle this much data easily.  And it 
>>>> must check for duplicate entries.
>>>>     
>>>>         
>>>>> Per the instructions here:
>>>>>
>>>>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting
>>>>>
>>>>> I set my debug logging first to 24579:
>>>>>
>>>>> 1          Trace function calls 
>>>>> 2          Debug packet handling 
>>>>> 8192       Replication debugging 
>>>>> 16384      Critical messages
>>>>>
>>>>> Then for the next try at reading logs I set it to 90115, the above plus:
>>>>>
>>>>> 65536      Plug-in debugging
>>>>>
>>>>> However, every time the log ended with the same set of lines noted above.
>>>>>   
>>>>>       
>>>>>           
>>>> 1 Trace is really the best for this particular problem, and as you have 
>>>> found it is limited for this particular problem.
>>>>
>>>> I think the next step would be to build the server with full debugging 
>>>> information (use -g and omit -O2 or any other -Ox) and get a stack trace 
>>>> with full debug information.
>>>>     
>>>>         
>>> After recompiling I got different results. Still a crash, but after a 
>>> different sequence of actions. This time I didn't create the other 
>>> subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce 
>>> the immediate crash, but it did crash after I tried to stop the import 
>>> (rather than watch the errors noted below).
>>>
>>> I went back to my original compile (-g -O2), and also couldn't reproduce 
>>> the instant crash under gdb. Here's the "where" output for that:
>>>
>>> (gdb) where
>>> #0  0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0
>>> #1  0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6
>>> #2  0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so
>>> #3  0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at 
>>> ldap/servers/slapd/back-ldbm/cache.c:585
>>> #4  0x00b46d9d in import_main_offline (arg=0xab704268) at 
>>> ldap/servers/slapd/back-ldbm/import.c:1300
>>> #5  0x00b47e4d in import_main (arg=0xab704268) at 
>>> ldap/servers/slapd/back-ldbm/import.c:1386
>>> #6  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
>>> #7  0x0022f5ab in start_thread () from /lib/libpthread.so.0
>>> #8  0x00d84cfe in clone () from /lib/libc.so.6
>>>
>>> Everything below is for my new compile, with -g but no -O2.
>>>
>>> I used this to attach gdb to the processes:
>>>
>>> gdb /opt/dirsrv/sbin/ns-slapd 12799
>>>
>>> I got a large number of these in the error log. Possibly this is one source 
>>> of my problem, needing to move the schema from Netscape Directory Server 
>>> 6.2 before I move the subtrees.
>>>
>>> [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, 
>>> ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown 
>>> object class "ldapauthvirtuseroc"
>>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping entry 
>>> "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN 
>>> Net" which violates schema, ending line 2182899 of file 
>>> "/home/cwood/tbn.ldif"
>>>   
>>>       
>> This is definitely a source of your problems.  You must update the 
>> schema before you can import the data.  Although the server should not 
>> crash either.
>>     
>
> I was having some trouble updating just the schema, so I grossly hacked up 
> migrate-ds.pl, DSMigration::fixup99user and DSMigration::migrateSchema. With 
> that I managed to get the schema from the old Netscape Directory Server 6.2 
> 99user.ldif to a new 389 1.2.5 98mycompany.ldif. The new schema showed 
> through in the admin gui, so I take it that I was successful in getting my 
> schema updated.
>
> I've been tearing my hair out trying to reproduce the crash and/or get my 
> data imported, even with the schema added. (I eyeballed the schema and a 
> failing directory entry, and they seemed to match to the naked eye.)
Even though the server should not crash, I believe that if we could get 
the syntax problem taken care of, you could import your data without a 
crash (according to the stack trace below).


The message appears to be entry \"ldapAuthLogin=username, 
ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=This Realm\" which 
violates attribute syntax, ending line 4701511 of file \"/home/cwood/dump.ld

Would it be possible for you to post excerpts from that file at or 
around that line/entry, along with the definition of the attributes used?
> Just when I gave up on trying to get it to crash, it did so once more (and 
> not since then). However, I did get this, not sure if it's of any use:
>
> Program received signal SIGABRT, Aborted.
> [Switching to Thread 0x9bdd9b90 (LWP 10174)]
> 0x003cc402 in __kernel_vsyscall ()
> (gdb) where
> #0  0x003cc402 in __kernel_vsyscall ()
> #1  0x001a3df0 in raise () from /lib/libc.so.6
> #2  0x001a5701 in abort () from /lib/libc.so.6
> #3  0x001dc28b in __libc_message () from /lib/libc.so.6
> #4  0x001e4595 in _int_free () from /lib/libc.so.6
> #5  0x001e66ac in _int_realloc () from /lib/libc.so.6
> #6  0x001e7276 in realloc () from /lib/libc.so.6
> #7  0x00da781e in slapi_ch_realloc (
>     block=0x8b886170 "p\003��p\003��ha\210\213ha\210\213g entry 
> \"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, 
> o=This Realm\" which violates attribute syntax, ending line 4701511 of file 
> \"/home/cwood/dump.ldi"...,
>     size=6752) at ldap/servers/slapd/ch_malloc.c:199
> #8  0x00e1184e in slapi_task_log_notice (task=0x8f97460, format=0xc14b2e "%s")
>     at ldap/servers/slapd/task.c:254
> #9  0x00bd7cce in import_log_notice (job=0xb3a28420,
>     format=0xc15934 "WARNING: skipping entry \"%s\" which violates attribute 
> syntax, ending line %d of file \"%s\"") at 
> ldap/servers/slapd/back-ldbm/import.c:193
> #10 0x00bdc7b9 in import_producer (param=0x8f977d0)
>     at ldap/servers/slapd/back-ldbm/import-threads.c:601
> #11 0x001686ed in ?? () from /usr/lib/libnspr4.so
> #12 0x002c55ab in start_thread () from /lib/libpthread.so.0
> #13 0x0024ccfe in clone () from /lib/libc.so.6
>   
This is definitely a bug - please file a bug at 
https://bugzilla.redhat.com/enter_bug.cgi?product=389 and please add the 
above stack trace and any other information.

This looks like a double free, or a free/realloc of un-malloc'd memory.  
Would it be possible for you to run the server using valgrind?
I have a script here http://github.com/richm/scripts called ns-slapd.sh 
that you can use to replace ns-slapd (moving it to ns-slapd.orig) and 
run the server/import/whatever using valgrind.
>  
>   
>> If you run the import job in gdb, and it crashes, gdb should report a 
>> SIGSEGV.  Since the import is multi-threaded, the crash may not be in 
>> the current thread.  In this case, you should dump the stack of all 
>> threads like this:
>> (gdb) thread apply all bt
>>     
>
> Sadly I only remembered gdb "where", but if it crashes again I'll get this 
> part right.
>
>   
>>> When I tried to cancel my import the log ended here, and ns-slapd crashed.
>>>
>>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import 
>>> threads...
>>>
>>> It took me a bit to figure out how to get a core file in gdb, but I 
>>> eventually got one.
>>>
>>> This is the result of "back" when I didn't have a core file:
>>>
>>> (gdb) back
>>> #0  0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 
>>> "%s") at ldap/servers/slapd/task.c:231
>>> #1  0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff "Import 
>>> threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
>>> #2  0x00a9dad7 in import_main_offline (arg=0x8a334f0) at 
>>> ldap/servers/slapd/back-ldbm/import.c:1196
>>> #3  0x00a9e01f in import_main (arg=0x8a334f0) at 
>>> ldap/servers/slapd/back-ldbm/import.c:1386
>>> #4  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
>>> #5  0x002165ab in start_thread () from /lib/libpthread.so.0
>>> #6  0x0081dcfe in clone () from /lib/libc.so.6
>>>
>>> This is what gdb printed to the terminal when the SIGSEGVs came in:
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0xb26fdb90 (LWP 13323)]
>>> 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 "%s") 
>>> at ldap/servers/slapd/task.c:231
>>> 231         len = 2 + strlen(buffer) + (task->task_log ? 
>>> strlen(task->task_log) : 0);
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0xb26fdb90 (LWP 14197)]
>>> 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 "%s") 
>>> at ldap/servers/slapd/task.c:231
>>> 231         len = 2 + strlen(buffer) + (task->task_log ? 
>>> strlen(task->task_log) : 0);
>>>
>>> When I figured out getting a core dump with my debug build, this is "where":
>>>
>>> (gdb) where
>>> #0  0x00d2309b in strlen () from /lib/libc.so.6
>>> #1  0x00d22e05 in strdup () from /lib/libc.so.6
>>> #2  0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) at 
>>> ldap/servers/slapd/ch_malloc.c:277
>>> #3  0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at 
>>> ldap/servers/slapd/dn.c:1229
>>> #4  0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, old_pw=0x0) 
>>> at ldap/servers/slapd/modify.c:582
>>> #5  0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at 
>>> ldap/servers/slapd/modify.c:526
>>> #6  0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at 
>>> ldap/servers/slapd/modify.c:416
>>> #7  0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of 
>>> bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626
>>> #8  0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at 
>>> ldap/servers/slapd/task.c:299
>>> #9  0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 
>>> "%s") at ldap/servers/slapd/task.c:264
>>> #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff 
>>> "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
>>> #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at 
>>> ldap/servers/slapd/back-ldbm/import.c:1196
>>> #12 0x00b8301f in import_main (arg=0xb0009318) at 
>>> ldap/servers/slapd/back-ldbm/import.c:1386
>>> #13 0x002516ed in ?? () from /usr/lib/libnspr4.so
>>> #14 0x001e65ab in start_thread () from /lib/libpthread.so.0
>>> #15 0x00d84cfe in clone () from /lib/libc.so.6
>>>   
>>>       
>> This is a known problem
>> *Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop 
>> "initialize Database" crashes the server
>>     
>>> --
>>> 389 users mailing list
>>> 389-us...@lists.fedoraproject.org
>>> https://admin.fedoraproject.org/mailman/listinfo/389-users
>>>   
>>>       
>> --
>> 389 users mailing list
>> 389-us...@lists.fedoraproject.org
>> https://admin.fedoraproject.org/mailman/listinfo/389-users
>>     
> --
> 389 users mailing list
> 389-us...@lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users

--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] importing large subtree crashes ns-slapd

Reply via email to