On 2012-06-15 04:00, Ángel González wrote:
On 13/06/12 05:26, Morgan L. Owens wrote:
After reading the performance improvements RFC about interned strings,
and its passing mention of a "special data structure (e.g.
zend_string) instead of char*", I've been thinking a little bit about
this and what such a structure could be.

But rather than interned strings, I thought that _implicit_
concatenation would be a bigger win in the long term. Like interning,
it relies on strings being immutable.

This zend_string is a composite type. Leaves are _almost_ identical to
existing string zvals - char* val, int len - but also an additional
"child_count" field. For leaves, child_count is zero (not incidentally
indicating that it _is_ a leaf). For internal nodes, "val" is a list
of zend_strings (child_count of them). "len" still refers to the total
string length (the sum of the len fields of its children).

So a string that has been built up through concatenation is
represented by a tree (actually a dag) of zend_strings. The edges in
this dag are all properly reference-counted; discarding a string
decrements the reference counts of its children.
How do you list then? As a single-linked list?
That would avoid reuse of the component strings in different
superstrings except from matching ends...

I was thinking just in terms of an array (the composite would be pointing either to an array of characters or an array of strings). Mainly just because that's how I pictured it (and haven't thought of a reason not to, since the number of children is known when the concatenated string is created, and fixed due to immutability).

Component strings aren't copied as such, only referenced. In that sense the choice of array vs. list comes down to where that reference is kept - in the parent string or the elder sibling. Sharing common suffixes would save a number of references, but when concatenating two existing strings, the list of component references in the _prefix_ would need to be copied for the sake of whatever else is using it at the time (otherwise they would end up with the concatenated string as well).

Speaking of concatenation, unless potentially scary stuff is done, concatenating three strings is done by concatenating two of them, then concatenating the result with the third, giving a binary tree; so why am I suggesting an array of arbitrary length? Think of an implementation of PHP's join()/implode() that exploits this structure.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to