If your application is single threaded, you could try using the bsdmalloc library. This is a fast malloc, but it is not multi-thread safe and will also tend to use more memory than the default malloc. For a comparison of different malloc libraries, look at the NOTES section at the end of umem_alloc(3MALLOC).
I got the following result with your example code: $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c $ ./xml 100000 iter in 21.445672 sec $ $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c -lbsdmalloc $ ./xml 100000 iter in 12.761969 sec $ I got similar results using Sun Studio 12. Again, bsdmalloc is not multi-thread safe, so use it with caution. HTH, David ----- Original Message ----- From: Matty <[EMAIL PROTECTED]> Date: Wednesday, April 30, 2008 2:22 pm Subject: Re: [perf-discuss] Application runs almost 2x slower on Nevada than Linux To: perf-discuss@opensolaris.org, [EMAIL PROTECTED] > On Mon, Apr 28, 2008 at 5:01 PM, <[EMAIL PROTECTED]> wrote: > > Hey Dude, > > I pulled down a copy of your test program and ran a few experiments. > > > > $ time ./xml > > 100000 iter in 22.715982 sec > > > > real 0m22.721s > > user 0m22.694s > > sys 0m0.007s > > > > This seems to indicate that all of our time is being spent in > usermode, so > > whatever it is in Solaris that is slower than Linux, it's not the operating > > system. > > > > Just to double check, I used the following dtrace to look at the > time this application is spending in syscalls: > > > > # dtrace -n 'syscall:::entry /execname == "xml"/ { self->traced = > timestamp; self->func = probefunc;} syscall:::return /self->traced/ { > @a[self->func] = sum(timestamp - self->traced); @b[self->func] = > count(); self->func = (char *)0; self->timestamp = 0;}' > > > > (This returns the time xml spends in syscalls as well as the number > of syscalls > > made) > > > > Time (nanoseconds): > > > > getpid 1978 > > sysconfig 2229 > > sigpending 3071 > > sysi86 3529 > > getrlimit 4234 > > setcontext 5763 > > fstat64 7042 > > close 16606 > > write 19898 > > getcwd 21302 > > ioctl 23668 > > munmap 25757 > > brk 28445 > > open 40712 > > resolvepath 51777 > > xstat 57616 > > mmap 159275 > > memcntl 267070 > > > > Number of invocations: > > > > fstat64 > 1 > > getcwd > 1 > > getpid > 1 > > getrlimit > 1 > > ioctl > 1 > > sigpending > 1 > > sysconfig > 1 > > sysi86 > 1 > > write > 1 > > setcontext > 2 > > memcntl > 6 > > close > 7 > > munmap > 8 > > open > 8 > > resolvepath > 9 > > xstat > 9 > > brk 10 > > mmap 32 > > > > This agrees with the output from time. You're spending a miniscule > amount > > of system time performing memory operations. > > > > The next place to look would be at libxml2. What library versions > do > > you have installed on the different machines. If the library versions > > don't match, or you've compiled with a different set of options, or > used > > a different compiler to build the library, you'll get different results > > for the benchmark. > > > > I ran my test on snv_84. It has version 2.6.23 installed. > > > > Another approach would be to use DTrace's profile provider. I obtained > > usable results by performing the following: > > > > # dtrace -n 'profile-1001 /execname == "xml"/ [EMAIL PROTECTED]()] = > count();} END { trunc(@a, 20);}' -c /tmp/xml > > > > This returns the 20 most frequent stack frames (and the number of times > > the occurred) when DTrace runs a profile probe at 1001hz. > > > > xmlDictLookup shows up as a popular function, but that may be a red > > herring. I'll include the top 5 here, but you can run this > command, or adjust > > the truncation to view any amount you'd like: > > > > libc.so.1`_free_unlocked+0x4c > > libc.so.1`free+0x2b > > libxml2.so.2`xmlFreeNodeList+0x19e > > libxml2.so.2`xmlFreeNodeList+0xb6 > > libxml2.so.2`xmlFreeNodeList+0xb6 > > libxml2.so.2`xmlFreeDoc+0xac > > xml`main+0x55 > > xml`_start+0x80 > > 61 > > > > libxml2.so.2`xmlDictLookup+0x60 > > libxml2.so.2`xmlParseNCName+0xab > > libxml2.so.2`xmlParseQName+0x44 > > libxml2.so.2`xmlParseAttribute2+0x54 > > libxml2.so.2`xmlParseStartTag2+0x246 > > libxml2.so.2`xmlParseElement+0x99 > > libxml2.so.2`xmlParseContent+0x12c > > libxml2.so.2`xmlParseElement+0x134 > > libxml2.so.2`xmlParseContent+0x12c > > libxml2.so.2`xmlParseElement+0x134 > > libxml2.so.2`xmlParseDocument+0x33b > > libxml2.so.2`xmlDoRead+0x6b > > libxml2.so.2`xmlReadMemory+0x36 > > xml`main+0x4c > > xml`_start+0x80 > > 66 > > > > libxml2.so.2`xmlParseStartTag2+0xb7e > > libxml2.so.2`xmlParseElement+0x99 > > libxml2.so.2`xmlParseContent+0x12c > > libxml2.so.2`xmlParseElement+0x134 > > libxml2.so.2`xmlParseContent+0x12c > > libxml2.so.2`xmlParseElement+0x134 > > libxml2.so.2`xmlParseDocument+0x33b > > libxml2.so.2`xmlDoRead+0x6b > > libxml2.so.2`xmlReadMemory+0x36 > > xml`main+0x4c > > xml`_start+0x80 > > 66 > > > > libc.so.1`_free_unlocked+0x53 > > libc.so.1`free+0x2b > > libxml2.so.2`xmlFreeNodeList+0x19e > > libxml2.so.2`xmlFreeNodeList+0xb6 > > libxml2.so.2`xmlFreeNodeList+0xb6 > > libxml2.so.2`xmlFreeDoc+0xac > > xml`main+0x55 > > xml`_start+0x80 > > 72 > > > > libxml2.so.2`xmlDictLookup+0x60 > > libxml2.so.2`xmlParseNCName+0xab > > libxml2.so.2`xmlParseQName+0x44 > > libxml2.so.2`xmlParseStartTag2+0x14c > > libxml2.so.2`xmlParseElement+0x99 > > libxml2.so.2`xmlParseContent+0x12c > > libxml2.so.2`xmlParseElement+0x134 > > libxml2.so.2`xmlParseContent+0x12c > > libxml2.so.2`xmlParseElement+0x134 > > libxml2.so.2`xmlParseDocument+0x33b > > libxml2.so.2`xmlDoRead+0x6b > > libxml2.so.2`xmlReadMemory+0x36 > > xml`main+0x4c > > xml`_start+0x80 > > 80 > > > > Hope this helps. It may be that you've got a newer / more optimized > > version of libxml2 on your Linux box. > > Thanks for the reply. When I mentioned memory operations, I probably > should have been more specific. The libxml test case I sent makes heavy > use of malloc, and according to the Sun Studio collect utility, the > calls to > malloc are taking up a significant percentage of the overall execution > time. > Here is a snippet from the collect output: > > Excl. Incl. Name > User CPU User CPU > sec. sec. > 26.789 26.789 <Total> > 2.292 10.497 umem_cache_alloc > 1.391 3.062 xmlDictLookup > 1.381 1.381 mutex_unlock > 1.301 5.084 umem_slab_alloc > 1.241 2.742 umem_slab_create > 1.091 15.671 xmlParseStartTag2 > 0.931 0.931 mutex_lock_impl > 0.801 1.771 umem_depot_alloc > 0.801 3.052 xmlParseNCName > 0.771 1.701 mutex_lock > 0.751 9.407 xmlSAX2StartElementNs > 0.680 11.768 malloc > 0.680 25.488 xmlParseElement > 0.640 8.506 xmlSAX2TextNode > 0.630 6.935 xmlParseCharData > 0.590 11.088 umem_alloc > 0.580 0.580 memcpy > 0.490 0.490 xmlDictComputeKey > 0.480 0.480 mutex_trylock > 0.420 6.344 xmlSAX2Characters > 0.420 0.420 xmlSkipBlankChars > 0.380 24.517 xmlParseContent > 0.380 0.380 xmlStrEqual > 0.360 1.771 xmlNewPropInternal > 0.330 0.330 xmlNextChar > 0.320 0.330 xmlParseAttValueInternal > > Has anyone seen this behavior before? Has anyone found any compiler > flags or light weight memory allocators that might help? > > Thanks, > - Ryan > _______________________________________________ > perf-discuss mailing list > perf-discuss@opensolaris.org _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org