On Sun, 2010-10-10, Stefan Fuhrmann wrote:
> On 07.10.2010 16:07, Julian Foad wrote:
> >> New Revision: 997203
> >>
> >> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
> >> Log:
> >> Merge r985037, r985046, r995507 and r995603 from the performance branch.
> >>
> >> These changes introduce the svn_stringbuf_appendbyte() function, which has
> >> significantly less overhead than svn_stringbuf_appendbytes(), and can be
> >> used in a number of places within our codebase.
> > Hi Stefan2.
> >
> > I have been wondering about the actual benefit of such tightly
> > hand-optimized code. Today I got around to giving it a quick spin.
> >
> > In my first test, it made no difference whatsoever, because the
> > optimized code is never executed. The recent merge r1002898 of r1001413
> > from the performance branch introduced a bug:
> >
> > - if (str->blocksize> old_len + 1)
> > + if (str->blocksize< old_len + 1)
> WTF how did that happen?
> Well, that's what reviews are for ;)
> > which totally disables the optimized code path.
> >
> > Fixed in r1005437.
> Thanks.
> > That solved, a loop doing a million simple appendbyte calls runs more
> > than twice as fast as calling appendbytes(..., 1). That's fantastic.
> >
> > But what about that hand-optimization? I wrote a simple version of the
> > function, and called it 'appendbyte0':
> >
> > svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
> > {
> > if (str->blocksize> str->len + 1)
> > {
> > str->data[str->len++] = byte;
> > str->data[str->len] = '\0';
> > }
> > else
> > svn_stringbuf_appendbytes(str,&byte, 1);
> > }
> >
> > Here are the results (see full patch attached):
> >
> > Times: appendbytes appendbyte0 appendbyte (in ms)
> > run: 89 31 34
> > run: 88 30 35
> > run: 88 31 34
> > run: 88 30 34
> > run: 88 31 34
> > min: 88 30 34
> >
> > This tells me that the hand-optimization is actually harmful and the
> > compiler does a 10% better job by itself.
> >
> > Have I made a mistake?
> My guess is that you might have made two mistakes actually.
Heh, OK.
> First, the number of operations was relatively low - everything
> in the low ms range could be due to secondary effects (and
> still be repeatable).
I can't think of any reason why. I ran the whole benchmark lots of
times. Occasionally some of the times were a big chunk longer due to
other system activity. Normally they were stable. I also ran it
several times with 10 million ops instead of 1 million, and the results
were exactly 10x longer with the same degree of variability.
> The most important problem would be compiler optimizations
> or lack thereof. Since the numbers are very large (50..100
> ticks per byte, depending on your CPU clock), I would assume
> that you used a normal debug build for testing. In that case,
You're right. I'll try again, with "--disable-debug -O2".
> the actual number of C statements has a large impact on
> the execution time. See my results below for details.
> > What are the results for your system?
> >
> > (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
> >
> Test code used (10^10 calls, newer re-allocate):
>
> int i;
> unsigned char c;
>
> svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);
OK so you're avoiding any calls to the "re-alloc". My tests included
re-allocs, but were long enough strings that this wasn't much overhead.
Nevertheless I'll avoid re-allocs so we can compare results fairly.
> for (i = 0; i < 50000000; ++i)
> {
> s->len = 0;
> for (c = 0; c < 200; ++c)
> svn_stringbuf_appendbyte (s, c);
> }
>
>
> XEON 55xx / Core i7, hyper-threading on, 3GHz peak
> 64 bits Linux, GCC 4.3.3; ./configure --disable-debug
>
> (1) 10,7s; IPC = 2.1
> (2) 8,11s; IPC = 1.4
> (3) 2,64s; IPC = 2.4
> (4) 2,43s; IPC = 2.3
> (1) use appendbytes gives 9 instructions in main, 59 in appbytes
> (2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in
> appbytes
> (3) appendbyte0 (same compiler output as the the non-handtuned code);
> 13 inst. in appbyte, 6 in main
> (4) tr...@head appendbyte; 11 inst. in appbyte, 6 in main
>
> Core2 2.4GHz, Win32, VS2010
> (not using valgrind to count instructions here; also not testing (2))
>
> (1) 17,0s release, 20,2s debug
> (3) 10,6s release, 12,2s debug
> (4) 9,7s release, 13,0s debug
With a test harness more similar to yours, and built with
"--disable-debug -O2", here are my relative numbers (but only 1 million
outer loops, compared to your 50 million):
$ svn-c-test subr string 24
Times for 1000000 x 200 bytes, in seconds
(1)appendbytes (3)appendbyte0 (4)tr...@head (5)macro
run: 7.03 2.06 1.37 0.53
run: 6.62 1.76 1.55 0.53
run: 6.53 1.67 1.44 0.53
run: 6.54 1.60 1.44 0.53
run: 6.52 1.84 1.37 0.53
min: 6.52 1.60 1.37 0.53
This agrees with what you found: the hand-tuned code gives a 10%
improvement.
This also shows another variant, writing the function as a macro,
reported in the column "(5)macro". This gives a further two-and-a-half
times speed-up in this test.
[...]
> * your numbers seem very large in comparison
> -> probably something else going on
Can you use my test code or show me yours? Mine has the harness in
"string-test.c" and the code under test in libsvn_subr - see attachment
- and the default build on my platform loads libsvn_subr as a shared
library. Maybe yours is completely different.
Other observations:
- Some callers that use very tight loops are in fact simply copying a
string, up as far as some particular end marker. They may benefit from
being re-written to first determine how much they want to copy and then
call svn_stringbuf_appendbytyes().
- The stringbuf type initially allocates a 1-byte space, and doubles
this every time it re-allocates. Could we not could allocate at least a
few bytes (4? 8?) as a minimum, without actually consuming more memory
in practice? That would avoid two or three re-allocs in some usage
patterns.
- Julian
A comparative performance test of svn_stringbuf_appendbyte().
* subversion/libsvn_subr/svn_string.c
(svn_stringbuf_appendbyte0): New function, for comparative testing.
* subversion/tests/libsvn_subr/string-test.c
(svn_stringbuf_appendbyte_macro): New macro, for comparative testing.
(test24): New test, measuring and printing a speed comparison.
(test_funcs): Add it.
--This line, and those below, will be ignored--
Index: subversion/libsvn_subr/svn_string.c
===================================================================
--- subversion/libsvn_subr/svn_string.c (revision 1006000)
+++ subversion/libsvn_subr/svn_string.c (working copy)
@@ -392,6 +392,29 @@ svn_stringbuf_ensure(svn_stringbuf_t *st
}
+/* Unoptimized version of svn_stringbuf_appendbyte(), for comparison. */
+void
+svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
+{
+ /* In most cases, there will be pre-allocated memory left
+ * to just write the new byte at the end of the used section
+ * and terminate the string properly.
+ */
+ if (str->blocksize > str->len + 1)
+ {
+ str->data[str->len++] = byte;
+ str->data[str->len] = '\0';
+ }
+ else
+ {
+ /* we need to re-allocate the string buffer
+ * -> let the more generic implementation take care of that part
+ */
+ svn_stringbuf_appendbytes(str, &byte, 1);
+ }
+}
+
+
/* WARNING - Optimized code ahead!
* This function has been hand-tuned for performance. Please read
* the comments below before modifying the code.
Index: subversion/tests/libsvn_subr/string-test.c
===================================================================
--- subversion/tests/libsvn_subr/string-test.c (revision 1006000)
+++ subversion/tests/libsvn_subr/string-test.c (working copy)
@@ -511,6 +511,97 @@ test23(apr_pool_t *pool)
return test_stringbuf_unequal("abc", "abb", pool);
}
+void
+svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte);
+
+#define svn_stringbuf_appendbyte_macro(str, byte) \
+ do { \
+ svn_stringbuf_t *svn__s = (str); \
+ if (svn__s->blocksize > svn__s->len + 1) { \
+ svn__s->data[svn__s->len++] = (byte); \
+ svn__s->data[svn__s->len] = '\0'; \
+ } else { \
+ char svn__byte = (byte); \
+ svn_stringbuf_appendbytes(svn__s, &svn__byte, 1); \
+ } \
+ } while (0)
+
+static svn_error_t *
+test24(apr_pool_t *pool)
+{
+ int i, ITERS = 1000000;
+ int f, FUNCTIONS = 4;
+ int r, RUNS = 5;
+ apr_time_t t[FUNCTIONS], min_t[FUNCTIONS];
+
+ for (f = 0; f < FUNCTIONS; f++)
+ min_t[f] = 1.0e12;
+
+ printf("Times for %d x 200 bytes, in seconds\n", ITERS);
+ printf(" (1)appendbytes (3)appendbyte0 (4)tr...@head (5)macro\n");
+
+ for (r = 0; r < RUNS; r++)
+ {
+ apr_time_t start;
+ svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);
+
+ start = apr_time_now();
+ for (i = 0; i < ITERS; ++i)
+ {
+ unsigned char c;
+ s->len = 0;
+ for (c = 0; c < 200; ++c)
+ svn_stringbuf_appendbytes(s, &c, 1);
+ }
+ t[0] = apr_time_now() - start;
+
+ start = apr_time_now();
+ for (i = 0; i < ITERS; ++i)
+ {
+ unsigned char c;
+ s->len = 0;
+ for (c = 0; c < 200; ++c)
+ svn_stringbuf_appendbyte0(s, c);
+ }
+ t[1] = apr_time_now() - start;
+
+ start = apr_time_now();
+ for (i = 0; i < ITERS; ++i)
+ {
+ unsigned char c;
+ s->len = 0;
+ for (c = 0; c < 200; ++c)
+ svn_stringbuf_appendbyte(s, c);
+ }
+ t[2] = apr_time_now() - start;
+
+ start = apr_time_now();
+ for (i = 0; i < ITERS; ++i)
+ {
+ unsigned char c;
+ s->len = 0;
+ for (c = 0; c < 200; ++c)
+ svn_stringbuf_appendbyte_macro(s, c);
+ }
+ t[3] = apr_time_now() - start;
+
+ printf("run:");
+ for (f = 0; f < FUNCTIONS; f++)
+ printf(" %10.2f", (float)t[f] / 1000000);
+ printf("\n");
+
+ for (f = 0; f < FUNCTIONS; f++)
+ if (t[f] < min_t[f]) min_t[f] = t[f];
+ }
+
+ printf("min:");
+ for (f = 0; f < FUNCTIONS; f++)
+ printf(" %10.2f", (float)min_t[f] / 1000000);
+ printf("\n");
+
+ return SVN_NO_ERROR;
+}
+
/*
====================================================================
If you add a new test to this file, update this array.
@@ -568,5 +659,7 @@ struct svn_test_descriptor_t test_funcs[
"compare stringbufs; different lengths"),
SVN_TEST_PASS2(test23,
"compare stringbufs; same length, different content"),
+ SVN_TEST_PASS2(test24,
+ "benchmark svn_stringbuf_appendbyte"),
SVN_TEST_NULL
};