On Fri, Jun 3, 2011 at 1:44 PM, Roy Smith <r...@panix.com> wrote: > In article <is9ikg0...@news1.newsguy.com>, > Chris Torek <nos...@torek.net> wrote: > >> Python might be penalized by its use of Unicode here, since a >> Boyer-Moore table for a full 16-bit Unicode string would need >> 65536 entries (one per possible ord() value). > > I'm not sure what you mean by "full 16-bit Unicode string"? Isn't > unicode inherently 32 bit? Or at least 20-something bit? Things like > UTF-16 are just one way to encode it.
The size of a Unicode character is like the size of a number. It's not defined in terms of a maximum. However, Unicode planes 0-2 have all the defined printable characters, and there are only 16 planes in total, so (since each plane is 2^16 characters) that kinda makes Unicode 18-bit or 20-bit. UTF-16 / UCS-2, therefore, uses two 16-bit numbers to store a 20-bit number. Why do I get the feeling I've met that before... Chris Angelico 136E:0100 CD 20 INT 20 -- http://mail.python.org/mailman/listinfo/python-list