This is such a long posting that I've broken it out into sections. Note that while developing this patch I discovered a Subtle Bug in CPython, which I have discussed in its own section below.
____________ THE OVERVIEW I don't remember where I picked it up, but I remember reading years ago that the simple, obvious Python approach for string concatenation: x = "a" + "b" is slow. Obviously it's fine for the simple case above, but if you're adding together twenty-five things: x = a + b + c + d + ... + y the performance is terrible, as the interpreter creates temporary objects for each intermediate result. (a + b, then (a + b) + c, then ((a + b) + c) + d, and so on.) The suggested idiom for fast string concatenation was this: x = "".join([a, b, c, d, ... y]) This works fine. But I don't like this approach, not only because it's kind of ugly and inobvious, but because it violates the "one obvious way to do it" principle. I mentioned all this to a friend of mine (Mike Thornburgh), and told him matter-of-factly that I'd like to fix this, but the design of Python simply precluded it. He said "no it doesn't!" and proceeded to describe in great detail an approach to permit fast string concatenation using +. I have implemented what he suggested--with only slight changes--and present it to you below. I only had to touch four files, and only did major surgery on one. The resulting Python passes the test suite as well as my unmodified locally-built Python 2.5. (I didn't install the external source trees for stuff like SQLite, so it fails those tests.) Note that times have changed; using Python 2.5, a + b + c isn't *that* much slower than "".join([]). But is is still slower. Or, was--with my patch, a + b + c becomes *the fastest method* for string concatenation. Hooray! (It didn't pay off as *much* as I'd hoped, but it's still an improvement.) _________ THE PATCH The core concept: adding two strings together no longer returns a pure "string" object. Instead, it returns a "string concatenation" object which holds references to the two strings but does not actually concatenate them... yet. The strings are concatenated only when someone requests the string's value, at which point it allocates all the space it needs and renders the concatenated string all at once. More to the point, if you add multiple strings together (a + b + c), it *doesn't* compute the intermediate strings (a + b). Upsides to this approach: * String concatenation using + is now the fastest way to concatenate strings (that I know of). * Throw off the shackles of "".join([]), you don't need it anymore. * Did I mention it was faster? Downsides to this approach: * Changes how PyStringObjects are stored internally; ob_sval is no longer a char[1], but a char *. This makes each StringObject four bytes larger. * Adds another memory dereference in order to get the value of a string, which is a teensy-weensy slowdown. * Would force a recompile of all C modules that deal directly with string objects (which I imagine is most of them). * Also, *requires* that C modules use the PyString_AS_STRING() macro, rather than casting the object and grabbing ob_sval directly. (I was pleased to see that the Python source was very good about using this macro; if all Python C modules are this well-behaved, this point is happily moot.) * On a related note, the file Mac/Modules/MacOS.c implies that there are Mac-specific Python scripts that peer directly into string objects. These would have to be changed to understand the new semantics. * String concatenation objects are 36 bytes larger than string objects, and this space will often go unreclaimed after the string is rendered. * When rendered, string concatenation objects storing long strings will allocate a second buffer from the heap to store the string. So this adds some minor allocation overhead (though this is offset by the speed gain from the approach overall). * Will definitely need some heavy review before it could go in, in particular I worry I got the semantics surrounding "interned" strings wrong. Internally, a "string concatenation" object is the same as a "string" object with two extra fields: an array of references to string objects (and string concatenation objects), and a count. Also, ob_sstate now stores a third flag, indicating whether or not the string is a string concatenation object. External users don't need to worry about these details, they just use PyString_AS_STRING() and all is well with their world. I've already added some optimizations to the approach: * If you're adding a string to an existing string concatenation object whose reference count is 1, who isn't full, and who hasn't been rendered yet, it adds the string directly to the existing concatenation object (rather than creating a new one). This works whether the existing object is on the right or the left side. * The general case is a + b + c ..., where you're only adding strings to the *end*. This will build a degenerate tree where the first slot points to the child. The tree for concatenating the first twenty-two letters of the alphabet would look like this: [0 1 2 3 4 5 6 7] | | | | | | | | | v v v v v v v | p q r s t u v | v [0 1 2 3 4 5 6 7] | | | | | | | | | v v v v v v v | i j k l m n o | v [0 1 2 3 4 5 6 7] | | | | | | | | v v v v v v v v a b c d e f g h The rendering function is by necessity recursive. However, it is optimized for this shape; when the tree looks like this, it will be rendered purely iteratively--no recursion. * If, after rendering, the resulting string will be small enough to fit inside the extra space normally used by the concatenation object, the final string will be copied on top of this otherwise now-wasted space. ______________ THE BENCHMARKS Benchmark 1: def add(a, b, c, ... t): return a + b + c + ... + t for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt") Benchmark 2: for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... + "ttt" Benchmark 3: for i in range(10000000): x = "".join(["aaa", "bbb", "ccc", ..., "ttt"]) Benchmark 4: for i in range(10000000): x = "aaa" x += "bbb" x += "ccc" ... x += "ttt" Benchmark 5: for i in range(10000000): array = [] array.append("aaa") array.append("bbb") array.append("ccc") ... array.append("ttt") x = "".join(array) Benchmark 6: for i in range(10000000): x = cStringIO.StringIO() x.write("aaa") x.write("bbb") x.write("ccc") ... x.write("ttt") Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | % improvement 1 | 32.1s | 32.8s | 26.7s | 22.8% 2 | 18.7s | 18.7s | 15.1s | 23.8% 3 | 16.1s | 16.2s | 16.6s | -1.1% 4 | 64.6s | 64.0s | 57.1s | 12.1% 5 | 74.1s | 79.4s | 76.7s | 3.5% 6 | 126.0s | too slow, didn't bother! ______________ THE SUBTLE BUG While developing this patch, I discovered a subtle bug in Python. It happened only on one gruesome test in the middle of test_descr.py. The problem was: '' == '' was returning False. Starting on line 1116 of stringobject.c we have this: if (op == Py_EQ) { /* Supporting Py_NE here as well does not save much time, since Py_NE is rarely used. */ if (a->ob_size == b->ob_size && (a->ob_sval[0] == b->ob_sval[0] && memcmp(a->ob_sval, b->ob_sval, a->ob_size) == 0)) { result = Py_True; } else { result = Py_False; } goto out; } The bug is with the "a->ob_sval[0] == b->ob_sval[0]". This is apparently a speed optimization; check that the first byte is the same before proceeding with the memcmp(). But it does this even when *both* strings are *zero length*! I doubt the Python source tree has an official position on what the first byte of a zero-length string should be, but I assert that it should be undefined. In practice, it's seemingly always set to the same value in the released Python, but with my changes it's now possible for them to be different. I think it's sinful to compare the first bytes of zero-length strings. Therefore I suggest this bugfix: if (a->ob_size == b->ob_size /* >> change >> */ && ( (!a->ob_size || a->ob_sval[0] == b->ob_sval[0]) && memcmp(a->ob_sval, b->ob_sval, I've already incorporated this fix in my code. __________ THE FUTURE If this patch is accepted, it paves the way for another cheap optimization: string slices could be done without copying. Add another field into the PyString structure: a reference to another PyString *. If non-NULL, it means the local ob_sval actually points into the ob_sval owned by that other PyString *. It'd be another four bytes, but we'd only need to incur that when creating a slice; we could set a bit in ob_sstate indicating "this is a slice", and if so we'd have to copy-on-write ob_sval, and free the reference when the object was destroyed. ______________ THE SUBMISSION I don't know the protocol from this point out; should I email the patch somewhere? Put it up on a web page? Post some diff output? (Or just forget about it entirely?) Also, for what it's worth: I fixed the .dsp so pythoncore builds under VC6 again. I would be happy to submit that separately. (What can I say, old habits die hard. I can't stand the newer Microsoft IDEs.) Cheers, /larry/ p.s. No, I haven't looked at a Unicode port yet. If there's interest in this patch at all, I might take a stab at it. -- http://mail.python.org/mailman/listinfo/python-list