>On 03/06/2011 03:58, Chris Torek wrote: >>> ------------------------------------------------- >> This is a bit surprising, since both "s1 in s2" and re.search() >> could use a Boyer-Moore-based algorithm for a sufficiently-long >> fixed string, and the time required should be proportional to that >> needed to set up the skip table. The re.compile() gets to re-use >> the table every time.
In article <mailman.2508.1307394262.9059.python-l...@python.org> Ian <hobso...@gmail.com> wrote: >Is that true? My immediate thought is that Boyer-Moore would quickly give >the number of characters to skip, but skipping them would be slow because >UTF8 encoded characters are variable sized, and the string would have to be >walked anyway. As I understand it, strings in python 3 are Unicode internally and (apparently) use wchar_t. Byte strings in python 3 are of course byte strings, not UTF-8 encoded. >Or am I misunderstanding something. Here's python 2.7 on a Linux box: >>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc') 38 39 40 >>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc') 56 60 64 This implies that strings in Python 2.x are just byte strings (same as b"..." in Python 3.x) and never actually contain unicode; and unicode strings (same as "..." in Python 3.x) use 4-byte "characters" per that box's wchar_t. -- In-Real-Life: Chris Torek, Wind River Systems Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603 email: gmail (figure it out) http://web.torek.net/torek/index.html
-- http://mail.python.org/mailman/listinfo/python-list