On Mon, Mar 15, 2010 at 12:57:08PM -0600, Rich Megginson wrote:
> Christopher Wood wrote:
> > On Thu, Mar 04, 2010 at 12:06:31PM -0700, Rich Megginson wrote:
> >   
> >> Christopher Wood wrote:
> >>     
> >>> On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote:
> >>>   
> >>>       
> >>>> Christopher Wood wrote:
> >>>>     
> >>>>         
> >>>>> I'm just getting started with 389 Directory Server (at work), and I've 
> >>>>> run into an issue that I'm not certain how to troubleshoot. I would 
> >>>>> greatly appreciate any assistance or tips you could offer, especially 
> >>>>> on where to look to see what's failing.
> >>>>>
> >>>>> Also, I apologize in advance for changing strings related to my 
> >>>>> employer's directory names and such, as I'm not comfortable with 
> >>>>> leaking that level information to a public list.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> As well you should be - you should always obscure sensitive information 
> >>>> like this.
> >>>>     
> >>>>         
> >>>>> Overview:
> >>>>>
> >>>>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other 
> >>>>> subtrees are fine.
> >>>>>
> >>>>>
> >>>>> Top-Level Questions:
> >>>>>
> >>>>> 1) How do I stop ns-slapd from crashing?
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> Good question.
> >>>>     
> >>>>         
> >>>>> 2) How do I figure out what precisely is causing the crash? (With 
> >>>>> various levels of debug logging I get the same log entry.)
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> You've already used the TRACE level (1) for logging - that's as verbose 
> >>>> as it gets for this particular operation.  Next step would be to try to 
> >>>> get a core file.
> >>>>     
> >>>>         
> >>>>> 3) Is it possible to simply import my initialization ldif without 
> >>>>> duplication checks?
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> No.
> >>>>     
> >>>>         
> >>>>> Background:
> >>>>>
> >>>>> At work we have NDS 6.2 (single master on a physical server, virtual 
> >>>>> machine slaves), and would like to move our directories intact to a 389 
> >>>>> 2.6 installation via replication.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> What platform/OS?  32-bit or 64-bit?  By NDS 6.2 I'm assuming you mean 
> >>>> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 
> >>>> should be hitting the mirrors tomorrow).
> >>>>     
> >>>>         
> >>> 32 bit
> >>>
> >>>   
> >>>       
> >>>>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 
> >>>>> with no difficulties.
> >>>>>
> >>>>> I compiled our 389 installation from the source packages downloaded 
> >>>>> from http://directory.fedoraproject.org/wiki/Source.
> >>>>>       
> >>>>>           
> >>>> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2?
> >>>>     
> >>>>         
> >>> I used 1.2.6.a1 to compile originally and produce core files to answer 
> >>> your questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the 
> >>> same version when trying to initially reproduce something.
> >>>
> >>>   
> >>>       
> >>>> What compiler flags did you use?
> >>>>     
> >>>>         
> >>> The makefile that came out of ./configure had these:
> >>>
> >>> CCASFLAGS = -g -O2
> >>> CFLAGS = -g -O2
> >>> CXXFLAGS = -g -O2
> >>>
> >>> For the plain debug build I edited that to insert these and rebuilt with 
> >>> make, make install:
> >>>
> >>> CCASFLAGS = -g
> >>> CFLAGS = -g
> >>> CXXFLAGS = -g
> >>>
> >>> (Fair warning that I'm not a programmer, so I'm not entirely sure doing 
> >>> that was right.)
> >>>
> >>>   
> >>>       
> >> Note that you don't have to edit the Makefile - you can do a make 
> >> distclean, then run configure like this:
> >>  > CFLAGS="-g" /path/to/configure ...
> >>  > make
> >>
> >> But that looks right, anyway.  Note that if you change the flags like 
> >> this by editing the makefile, you will have to do a make clean to remove 
> >> the old object files, so that they will be rebuilt with the new flags.
> >>     
> >
> > I did a make clean before rebuilding in this case. For my later debug 
> > builds I've used your step to re-run configure.
> >
> >   
> >>>> Do you have a core file?  If so, try using gdb
> >>>> gdb /path/to/ns-slapd /path/to/core.pid
> >>>> once in gdb, type the "where" command
> >>>> (gdb) where
> >>>>     
> >>>>         
> >>> The original crash didn't produce a core file, but I could get one by 
> >>> attaching gdb later, to both the original build and a debug build.
> >>>
> >>>   
> >>>       
> >>>>> The underlying platform is:
> >>>>>
> >>>>> $ uname -a
> >>>>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 
> >>>>> EDT 2009 i686 i686 i386 GNU/Linux
> >>>>> $ cat /etc/redhat-release 
> >>>>> CentOS release 5.4 (Final)
> >>>>>
> >>>>> $ free
> >>>>>              total       used       free     shared    buffers     
> >>>>> cached
> >>>>> Mem:       3894000    1336012    2557988          0     144944    
> >>>>> 1004716
> >>>>> -/+ buffers/cache:     186352    3707648
> >>>>> Swap:      2031608          0    2031608
> >>>>>
> >>>>>
> >>>>> Procedure To Crash 389's ns-slapd:
> >>>>>
> >>>>> a) In the NDS 6.2 admin console, create a new replication agreement for 
> >>>>> the "o=This Big Net" subtree, and choose to "Create consumer 
> >>>>> initialization file".
> >>>>>
> >>>>> b) Copy the file to the 389 server.
> >>>>>
> >>>>> c) In the 389 2.6 admin console for the Directory Server, in the 
> >>>>> Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and 
> >>>>> choose "Initialize Database". Use the ldif file copied over.
> >>>>>
> >>>>> The ns-slapd process crashes, and I always get this in 
> >>>>> /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines:
> >>>>>
> >>>>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file 
> >>>>> "/home/cwood/tbn.ldif"
> >>>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
> >>>>>
> >>>>>
> >>>>> Other Details:
> >>>>>
> >>>>>
> >>>>> I found two bugs with the str2entry_dupcheck string in it, but they 
> >>>>> don't seem pertinent:
> >>>>>
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=548115
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=243488
> >>>>>
> >>>>>
> >>>>> This says that str2entry_dupcheck could be about two things:
> >>>>>
> >>>>> http://docs.sun.com/source/816-6699-10/ax_errcd.html
> >>>>>
> >>>>> "While attempting to convert a string entry to an LDAP entry, the 
> >>>>> server found that the entry has no DN."
> >>>>>
> >>>>> "The server failed to add a value to the value tree."
> >>>>>
> >>>>> (But this is an exported database from NDS 6.2, and I'm fairly sure, 
> >>>>> without reading them all, that every entry will have a DN.)
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> The log message
> >>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
> >>>>
> >>>> is just trace information, not a report of a problem or error.
> >>>>
> >>>> Does the crash happen almost immediately?  Or does it take a while?  If 
> >>>> the problem happens quickly, it would be worthwhile to scan the first 
> >>>> couple of dozen entries looking for things like - entries without a DN - 
> >>>> attributes without a value
> >>>>     
> >>>>         
> >>> I checked, and I couldn't see any data errors of this type.
> >>>  
> >>>   
> >>>       
> >>>>> If 389 is trying to check for duplicate entries, perhaps there are 
> >>>>> simply too many DNs?
> >>>>>
> >>>>> $ grep '^dn:' tbn.ldif | wc -l
> >>>>> 636985
> >>>>> $ ls -lh acc.ldif 
> >>>>> -rw-r--r-- 1 cwood cwood 755M Mar  3 11:24 tbn.ldif
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> No.  The server should be able to handle this much data easily.  And it 
> >>>> must check for duplicate entries.
> >>>>     
> >>>>         
> >>>>> Per the instructions here:
> >>>>>
> >>>>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting
> >>>>>
> >>>>> I set my debug logging first to 24579:
> >>>>>
> >>>>> 1        Trace function calls 
> >>>>> 2        Debug packet handling 
> >>>>> 8192     Replication debugging 
> >>>>> 16384    Critical messages
> >>>>>
> >>>>> Then for the next try at reading logs I set it to 90115, the above plus:
> >>>>>
> >>>>> 65536    Plug-in debugging
> >>>>>
> >>>>> However, every time the log ended with the same set of lines noted 
> >>>>> above.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> 1 Trace is really the best for this particular problem, and as you have 
> >>>> found it is limited for this particular problem.
> >>>>
> >>>> I think the next step would be to build the server with full debugging 
> >>>> information (use -g and omit -O2 or any other -Ox) and get a stack trace 
> >>>> with full debug information.
> >>>>     
> >>>>         
> >>> After recompiling I got different results. Still a crash, but after a 
> >>> different sequence of actions. This time I didn't create the other 
> >>> subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce 
> >>> the immediate crash, but it did crash after I tried to stop the import 
> >>> (rather than watch the errors noted below).
> >>>
> >>> I went back to my original compile (-g -O2), and also couldn't reproduce 
> >>> the instant crash under gdb. Here's the "where" output for that:
> >>>
> >>> (gdb) where
> >>> #0  0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0
> >>> #1  0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6
> >>> #2  0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so
> >>> #3  0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at 
> >>> ldap/servers/slapd/back-ldbm/cache.c:585
> >>> #4  0x00b46d9d in import_main_offline (arg=0xab704268) at 
> >>> ldap/servers/slapd/back-ldbm/import.c:1300
> >>> #5  0x00b47e4d in import_main (arg=0xab704268) at 
> >>> ldap/servers/slapd/back-ldbm/import.c:1386
> >>> #6  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
> >>> #7  0x0022f5ab in start_thread () from /lib/libpthread.so.0
> >>> #8  0x00d84cfe in clone () from /lib/libc.so.6
> >>>
> >>> Everything below is for my new compile, with -g but no -O2.
> >>>
> >>> I used this to attach gdb to the processes:
> >>>
> >>> gdb /opt/dirsrv/sbin/ns-slapd 12799
> >>>
> >>> I got a large number of these in the error log. Possibly this is one 
> >>> source of my problem, needing to move the schema from Netscape Directory 
> >>> Server 6.2 before I move the subtrees.
> >>>
> >>> [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, 
> >>> ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown 
> >>> object class "ldapauthvirtuseroc"
> >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping 
> >>> entry "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, 
> >>> ou=UsersByRealm, o=TBN Net" which violates schema, ending line 2182899 of 
> >>> file "/home/cwood/tbn.ldif"
> >>>   
> >>>       
> >> This is definitely a source of your problems.  You must update the 
> >> schema before you can import the data.  Although the server should not 
> >> crash either.
> >>     
> >
> > I was having some trouble updating just the schema, so I grossly hacked up 
> > migrate-ds.pl, DSMigration::fixup99user and DSMigration::migrateSchema. 
> > With that I managed to get the schema from the old Netscape Directory 
> > Server 6.2 99user.ldif to a new 389 1.2.5 98mycompany.ldif. The new schema 
> > showed through in the admin gui, so I take it that I was successful in 
> > getting my schema updated.
> >
> > I've been tearing my hair out trying to reproduce the crash and/or get my 
> > data imported, even with the schema added. (I eyeballed the schema and a 
> > failing directory entry, and they seemed to match to the naked eye.)
> Even though the server should not crash, I believe that if we could get 
> the syntax problem taken care of, you could import your data without a 
> crash (according to the stack trace below).
> 
> The message appears to be entry \"ldapAuthLogin=username, 
> ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=This Realm\" which 
> violates attribute syntax, ending line 4701511 of file \"/home/cwood/dump.ld
> 
> Would it be possible for you to post excerpts from that file at or 
> around that line/entry, along with the definition of the attributes used?

Here's the entry, data munged:

# entry-id: 1156136
dn: ldapAuthLogin=cwood, ldapAuthRealmName=1.mycompany.com, ou=UsersByRealm, 
o=maintree
modifyTimestamp: 20090611221103Z
modifiersName: cn=ldapauth provisioner,cn=config,o=acc global net
ldapAuthLogin: cwood
ldapAuthRealmName: 1.mycompany.com
objectClass: top
objectClass: ldapAuthVirtUserOC
ldapAuthMailHost: mail.mycompany.com
ldapAuthRadiusDNSProfile: dns one
ldapAuthBandwidthLimit: x503
ldapAuthForwarding:
ldapAuthControlCodePtr: 1234567
creatorsName: cn=ldapauth provisioner,cn=config,o=maintree
createTimestamp: 20070716131008Z
nsUniqueId: c5760388-1dd111b2-8022c31d-45260000

I twigged to your point about definition of attributes, and tried to add this 
via ldapmodify (now that I have the right schema).

I got this error from 389:

adding new entry "ldapAuthLogin=cwood, ldapAuthRealmName=1.mycompany.com, 
ou=UsersByRealm, o=maintree"
ldapmodify: Invalid syntax (21)
        additional info: ldapAuthControlCodePtr: value #0 invalid per syntax

The interesting thing about ldapAuthControlCodePtr is that the value is alway 
an integer, but in our Netscape Directory Server 6.2 installation it's in the 
schema as a DN. This has of course worked fine for a number of years, but 389 
seems to want a DN to actually be a DN in an ldapmodify (which is just fine by 
me).

I changed the ldapAuthControlCodePtr attribute to be an integer in 389, and I 
was able to ldapmodify the entry into 389 and find it via ldapsearch.

> > Just when I gave up on trying to get it to crash, it did so once more (and 
> > not since then). However, I did get this, not sure if it's of any use:
> >
> > Program received signal SIGABRT, Aborted.
> > [Switching to Thread 0x9bdd9b90 (LWP 10174)]
> > 0x003cc402 in __kernel_vsyscall ()
> > (gdb) where
> > #0  0x003cc402 in __kernel_vsyscall ()
> > #1  0x001a3df0 in raise () from /lib/libc.so.6
> > #2  0x001a5701 in abort () from /lib/libc.so.6
> > #3  0x001dc28b in __libc_message () from /lib/libc.so.6
> > #4  0x001e4595 in _int_free () from /lib/libc.so.6
> > #5  0x001e66ac in _int_realloc () from /lib/libc.so.6
> > #6  0x001e7276 in realloc () from /lib/libc.so.6
> > #7  0x00da781e in slapi_ch_realloc (
> >     block=0x8b886170 "p\003��p\003��ha\210\213ha\210\213g entry 
> > \"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, 
> > o=This Realm\" which violates attribute syntax, ending line 4701511 of file 
> > \"/home/cwood/dump.ldi"...,
> >     size=6752) at ldap/servers/slapd/ch_malloc.c:199
> > #8  0x00e1184e in slapi_task_log_notice (task=0x8f97460, format=0xc14b2e 
> > "%s")
> >     at ldap/servers/slapd/task.c:254
> > #9  0x00bd7cce in import_log_notice (job=0xb3a28420,
> >     format=0xc15934 "WARNING: skipping entry \"%s\" which violates 
> > attribute syntax, ending line %d of file \"%s\"") at 
> > ldap/servers/slapd/back-ldbm/import.c:193
> > #10 0x00bdc7b9 in import_producer (param=0x8f977d0)
> >     at ldap/servers/slapd/back-ldbm/import-threads.c:601
> > #11 0x001686ed in ?? () from /usr/lib/libnspr4.so
> > #12 0x002c55ab in start_thread () from /lib/libpthread.so.0
> > #13 0x0024ccfe in clone () from /lib/libc.so.6
> >   
> This is definitely a bug - please file a bug at 
> https://bugzilla.redhat.com/enter_bug.cgi?product=389 and please add the 
> above stack trace and any other information.

Will do.

> This looks like a double free, or a free/realloc of un-malloc'd memory.  
> Would it be possible for you to run the server using valgrind?
> I have a script here http://github.com/richm/scripts called ns-slapd.sh 
> that you can use to replace ns-slapd (moving it to ns-slapd.orig) and 
> run the server/import/whatever using valgrind.

I'll give it a go.

> >  
> >   
> >> If you run the import job in gdb, and it crashes, gdb should report a 
> >> SIGSEGV.  Since the import is multi-threaded, the crash may not be in 
> >> the current thread.  In this case, you should dump the stack of all 
> >> threads like this:
> >> (gdb) thread apply all bt
> >>     
> >
> > Sadly I only remembered gdb "where", but if it crashes again I'll get this 
> > part right.
> >
> >   
> >>> When I tried to cancel my import the log ended here, and ns-slapd crashed.
> >>>
> >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import 
> >>> threads...
> >>>
> >>> It took me a bit to figure out how to get a core file in gdb, but I 
> >>> eventually got one.
> >>>
> >>> This is the result of "back" when I didn't have a core file:
> >>>
> >>> (gdb) back
> >>> #0  0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 
> >>> "%s") at ldap/servers/slapd/task.c:231
> >>> #1  0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff 
> >>> "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
> >>> #2  0x00a9dad7 in import_main_offline (arg=0x8a334f0) at 
> >>> ldap/servers/slapd/back-ldbm/import.c:1196
> >>> #3  0x00a9e01f in import_main (arg=0x8a334f0) at 
> >>> ldap/servers/slapd/back-ldbm/import.c:1386
> >>> #4  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
> >>> #5  0x002165ab in start_thread () from /lib/libpthread.so.0
> >>> #6  0x0081dcfe in clone () from /lib/libc.so.6
> >>>
> >>> This is what gdb printed to the terminal when the SIGSEGVs came in:
> >>>
> >>> Program received signal SIGSEGV, Segmentation fault.
> >>> [Switching to Thread 0xb26fdb90 (LWP 13323)]
> >>> 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 
> >>> "%s") at ldap/servers/slapd/task.c:231
> >>> 231         len = 2 + strlen(buffer) + (task->task_log ? 
> >>> strlen(task->task_log) : 0);
> >>>
> >>> Program received signal SIGSEGV, Segmentation fault.
> >>> [Switching to Thread 0xb26fdb90 (LWP 14197)]
> >>> 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 
> >>> "%s") at ldap/servers/slapd/task.c:231
> >>> 231         len = 2 + strlen(buffer) + (task->task_log ? 
> >>> strlen(task->task_log) : 0);
> >>>
> >>> When I figured out getting a core dump with my debug build, this is 
> >>> "where":
> >>>
> >>> (gdb) where
> >>> #0  0x00d2309b in strlen () from /lib/libc.so.6
> >>> #1  0x00d22e05 in strdup () from /lib/libc.so.6
> >>> #2  0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) 
> >>> at ldap/servers/slapd/ch_malloc.c:277
> >>> #3  0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at 
> >>> ldap/servers/slapd/dn.c:1229
> >>> #4  0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, 
> >>> old_pw=0x0) at ldap/servers/slapd/modify.c:582
> >>> #5  0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at 
> >>> ldap/servers/slapd/modify.c:526
> >>> #6  0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at 
> >>> ldap/servers/slapd/modify.c:416
> >>> #7  0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of 
> >>> bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626
> >>> #8  0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at 
> >>> ldap/servers/slapd/task.c:299
> >>> #9  0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 
> >>> "%s") at ldap/servers/slapd/task.c:264
> >>> #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff 
> >>> "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
> >>> #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at 
> >>> ldap/servers/slapd/back-ldbm/import.c:1196
> >>> #12 0x00b8301f in import_main (arg=0xb0009318) at 
> >>> ldap/servers/slapd/back-ldbm/import.c:1386
> >>> #13 0x002516ed in ?? () from /usr/lib/libnspr4.so
> >>> #14 0x001e65ab in start_thread () from /lib/libpthread.so.0
> >>> #15 0x00d84cfe in clone () from /lib/libc.so.6
> >>>   
> >>>       
> >> This is a known problem
> >> *Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop 
> >> "initialize Database" crashes the server
> >>     
> >>> --
> >>> 389 users mailing list
> >>> 389-us...@lists.fedoraproject.org
> >>> https://admin.fedoraproject.org/mailman/listinfo/389-users
> >>>   
> >>>       
> >> --
> >> 389 users mailing list
> >> 389-us...@lists.fedoraproject.org
> >> https://admin.fedoraproject.org/mailman/listinfo/389-users
> >>     
> > --
> > 389 users mailing list
> > 389-us...@lists.fedoraproject.org
> > https://admin.fedoraproject.org/mailman/listinfo/389-users
> 
> --
> 389 users mailing list
> 389-us...@lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users
--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Reply via email to