Congratulations on the clear way in which you have set out your proposal. I hope that it will make its way to PEPdom.
Colin W. Larry Hastings wrote: > This is such a long posting that I've broken it out into sections. > Note that while developing this patch I discovered a Subtle Bug > in CPython, which I have discussed in its own section below. > > ____________ > THE OVERVIEW > > I don't remember where I picked it up, but I remember reading years > ago that the simple, obvious Python approach for string concatenation: > x = "a" + "b" > is slow. Obviously it's fine for the simple case above, but if you're > adding together twenty-five things: > x = a + b + c + d + ... + y > the performance is terrible, as the interpreter creates temporary > objects > for each intermediate result. (a + b, then (a + b) + c, > then ((a + b) + c) + d, and so on.) > > The suggested idiom for fast string concatenation was this: > x = "".join([a, b, c, d, ... y]) > This works fine. But I don't like this approach, not only because it's > kind of ugly and inobvious, but because it violates the "one obvious > way > to do it" principle. > > I mentioned all this to a friend of mine (Mike Thornburgh), and told > him matter-of-factly that I'd like to fix this, but the design of > Python > simply precluded it. He said "no it doesn't!" and proceeded to > describe > in great detail an approach to permit fast string concatenation using > +. > I have implemented what he suggested--with only slight changes--and > present it to you below. I only had to touch four files, and only did > major surgery on one. The resulting Python passes the test suite as > well > as my unmodified locally-built Python 2.5. (I didn't install the > external > source trees for stuff like SQLite, so it fails those tests.) > > Note that times have changed; using Python 2.5, a + b + c isn't *that* > much slower than "".join([]). But is is still slower. Or, was--with > my > patch, a + b + c becomes *the fastest method* for string concatenation. > Hooray! > > (It didn't pay off as *much* as I'd hoped, but it's still an > improvement.) > > _________ > THE PATCH > > The core concept: adding two strings together no longer returns a pure > "string" object. Instead, it returns a "string concatenation" object > which holds references to the two strings but does not actually > concatenate > them... yet. The strings are concatenated only when someone requests > the > string's value, at which point it allocates all the space it needs and > renders the concatenated string all at once. > > More to the point, if you add multiple strings together (a + b + c), > it *doesn't* compute the intermediate strings (a + b). > > Upsides to this approach: > > * String concatenation using + is now the fastest way to > concatenate > strings (that I know of). > > * Throw off the shackles of "".join([]), you don't need it > anymore. > > * Did I mention it was faster? > > Downsides to this approach: > > * Changes how PyStringObjects are stored internally; ob_sval is > no > longer a char[1], but a char *. This makes each StringObject > four bytes larger. > > * Adds another memory dereference in order to get the value of > a string, which is a teensy-weensy slowdown. > > * Would force a recompile of all C modules that deal directly > with string objects (which I imagine is most of them). > > * Also, *requires* that C modules use the PyString_AS_STRING() > macro, rather than casting the object and grabbing ob_sval > directly. (I was pleased to see that the Python source > was very good about using this macro; if all Python C > modules are this well-behaved, this point is happily moot.) > > * On a related note, the file Mac/Modules/MacOS.c implies > that there are Mac-specific Python scripts that peer > directly into string objects. These would have to be > changed to understand the new semantics. > > * String concatenation objects are 36 bytes larger than > string objects, and this space will often go unreclaimed > after the string is rendered. > > * When rendered, string concatenation objects storing long > strings will allocate a second buffer from the heap to > store the string. So this adds some minor allocation > overhead (though this is offset by the speed gain from > the approach overall). > > * Will definitely need some heavy review before it could > go in, in particular I worry I got the semantics surrounding > "interned" strings wrong. > > > Internally, a "string concatenation" object is the same as a "string" > object with two extra fields: an array of references to string objects > (and string concatenation objects), and a count. Also, ob_sstate now > stores a third flag, indicating whether or not the string is a string > concatenation object. External users don't need to worry about these > details, they just use PyString_AS_STRING() and all is well with their > world. > > I've already added some optimizations to the approach: > > * If you're adding a string to an existing string concatenation > object whose reference count is 1, who isn't full, and who > hasn't been rendered yet, it adds the string directly to the > existing concatenation object (rather than creating a new > one). > This works whether the existing object is on the right or the > left side. > > * The general case is a + b + c ..., where you're only adding > strings to the *end*. This will build a degenerate tree > where > the first slot points to the child. The tree for > concatenating > the first twenty-two letters of the alphabet would look like > this: > > [0 1 2 3 4 5 6 7] > | | | | | | | | > | v v v v v v v > | p q r s t u v > | > v > [0 1 2 3 4 5 6 7] > | | | | | | | | > | v v v v v v v > | i j k l m n o > | > v > [0 1 2 3 4 5 6 7] > | | | | | | | | > v v v v v v v v > a b c d e f g h > > The rendering function is by necessity recursive. However, > it is optimized for this shape; when the tree looks like > this, > it will be rendered purely iteratively--no recursion. > > * If, after rendering, the resulting string will be small > enough > to fit inside the extra space normally used by the > concatenation > object, the final string will be copied on top of this > otherwise > now-wasted space. > > ______________ > THE BENCHMARKS > > Benchmark 1: > def add(a, b, c, ... t): return a + b + c + ... + t > for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt") > Benchmark 2: > for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... + > "ttt" > Benchmark 3: > for i in range(10000000): x = "".join(["aaa", "bbb", "ccc", > ..., "ttt"]) > Benchmark 4: > for i in range(10000000): > x = "aaa" > x += "bbb" > x += "ccc" > ... > x += "ttt" > Benchmark 5: > for i in range(10000000): > array = [] > array.append("aaa") > array.append("bbb") > array.append("ccc") > ... > array.append("ttt") > x = "".join(array) > Benchmark 6: > for i in range(10000000): > x = cStringIO.StringIO() > x.write("aaa") > x.write("bbb") > x.write("ccc") > ... > x.write("ttt") > > Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | % > improvement > 1 | 32.1s | 32.8s | 26.7s | > 22.8% > 2 | 18.7s | 18.7s | 15.1s | > 23.8% > 3 | 16.1s | 16.2s | 16.6s | > -1.1% > 4 | 64.6s | 64.0s | 57.1s | > 12.1% > 5 | 74.1s | 79.4s | 76.7s | > 3.5% > 6 | 126.0s | too slow, didn't bother! > > ______________ > THE SUBTLE BUG > > While developing this patch, I discovered a subtle bug in Python. > It happened only on one gruesome test in the middle of test_descr.py. > The problem was: '' == '' was returning False. > > Starting on line 1116 of stringobject.c we have this: > > if (op == Py_EQ) { > /* Supporting Py_NE here as well does not save > much time, since Py_NE is rarely used. */ > if (a->ob_size == b->ob_size > && (a->ob_sval[0] == b->ob_sval[0] > && memcmp(a->ob_sval, b->ob_sval, > a->ob_size) == 0)) { > result = Py_True; > } else { > result = Py_False; > } > goto out; > } > > The bug is with the "a->ob_sval[0] == b->ob_sval[0]". This is > apparently > a speed optimization; check that the first byte is the same before > proceeding with the memcmp(). But it does this even when *both* strings > are *zero length*! I doubt the Python source tree has an official > position on what the first byte of a zero-length string should be, but > I assert that it should be undefined. In practice, it's seemingly > always > set to the same value in the released Python, but with my changes it's > now possible for them to be different. > > I think it's sinful to compare the first bytes of zero-length strings. > Therefore I suggest this bugfix: > > if (a->ob_size == b->ob_size > /* >> change >> */ && ( (!a->ob_size || a->ob_sval[0] == > b->ob_sval[0]) > && memcmp(a->ob_sval, b->ob_sval, > > I've already incorporated this fix in my code. > > __________ > THE FUTURE > > If this patch is accepted, it paves the way for another cheap > optimization: string slices could be done without copying. > Add another field into the PyString structure: a reference > to another PyString *. If non-NULL, it means the local ob_sval > actually points into the ob_sval owned by that other PyString *. > It'd be another four bytes, but we'd only need to incur that > when creating a slice; we could set a bit in ob_sstate indicating > "this is a slice", and if so we'd have to copy-on-write ob_sval, > and free the reference when the object was destroyed. > > ______________ > THE SUBMISSION > > I don't know the protocol from this point out; should I email the patch > somewhere? Put it up on a web page? Post some diff output? (Or just > forget about it entirely?) > > Also, for what it's worth: I fixed the .dsp so pythoncore builds under > VC6 again. I would be happy to submit that separately. (What can I > say, old habits die hard. I can't stand the newer Microsoft IDEs.) > > > Cheers, > > > /larry/ > > p.s. No, I haven't looked at a Unicode port yet. If there's interest > in > this patch at all, I might take a stab at it. > -- http://mail.python.org/mailman/listinfo/python-list