In article <is9ikg0...@news1.newsguy.com>, Chris Torek <nos...@torek.net> wrote:
> Python might be penalized by its use of Unicode here, since a > Boyer-Moore table for a full 16-bit Unicode string would need > 65536 entries (one per possible ord() value). I'm not sure what you mean by "full 16-bit Unicode string"? Isn't unicode inherently 32 bit? Or at least 20-something bit? Things like UTF-16 are just one way to encode it. In any case, while I could imagine building a 2^16 entry jump table, clearly it's infeasible (with today's hardware) to build a 2^32 entry table. But, there's nothing that really requires you to build a table at all. If I understand the algorithm right, all that's really required is that you can map a character to a shift value. For an 8 bit character set, an indexed jump table makes sense. For a larger character set, I would imagine you would do some heuristic pre-processing to see if your search string consisted only of characters in one unicode plane and use that fact to build a table which only indexes that plane. Or, maybe use a hash table instead of a regular indexed table. Not as fast, but only slower by a small constant factor, which is not a horrendous price to pay in a fully i18n world :-) -- http://mail.python.org/mailman/listinfo/python-list