Defeating variable-width encodings

Brent Dax Fri, 07 Sep 2001 13:23:01 -0700
A thought I had: if variable-width encodings are so difficult because
it's hard to index into them by character, why don't we break them up
ourselves?

        +PV-------+   +strchunk---------------------+-+
+strchunk---------------------+-+
        |string   |-->|the quick brown fox jumped ov|>+-->|er the lazy dog
|/|
        |...      |   +-----------------------------+-+
+-----------------------------+-+

Now, if we want to substr($str, 40, 1), we can skip the first chunk.
(32 was a number I picked out of the air; other numbers may be better.)

This avoids the possible huge overheads of other linked-list approaches
while also avoiding some of the linear scanning that would otherwise be
required to index into the string.

As far as things with lvalue substr()...we could fudge that number a bit
and allow strchunks to be a little more or less than 32, as long as they
know their size.  Then, whey you scan, you just add up the number of
characters in each chunk until you overshoot.  That makes scanning a bit
slower, but not much.  (We'd probably also want the string to rebalance
itself periodically, but that's a different story.)

An alternate approach would be to remember how far into the string you
have to index to get to certain points in the string.  (For the purpose
of this part of the document, a 'byte' is a codepoint and a 'character'
is an abstract character.)  For example:

        +PV-------+
        |string   |-->"the quick brown fox jumped over the lazy dog"
        |length 44|
        |bytes  44|
        |half   22|
      |quar   11|
        |threeq 33|
        |...      |

Although in this example the string is normal ASCII, consider what we
would have if we replaced the 'o' in 'brown' and the 'a' in 'lazy' with
two-byte characters (represented by a doubled letter):

        +PV-------+
        |string   |-->"the quick broown fox jumped over the laazy dog"
        |length 44|
        |bytes  46|
        |half   23|
        |quar   11|
        |threeq 34|
        |...      |

Now, on a call like substr($str, 36, 1) we can skip all the way to byte
34--which we know represents character number 33--and count from there.

--Brent Dax
[EMAIL PROTECTED]

"...and if the answers are inadequate, the pumpqueen will be overthrown
in a bloody coup by programmers flinging dead Java programs over the
walls with a trebuchet."
Defeating variable-width encodings

Reply via email to