Re: how to avoid leading white spaces

Chris Torek Wed, 08 Jun 2011 19:42:40 -0700

>On 03/06/2011 03:58, Chris Torek wrote:
>>> -------------------------------------------------
>> This is a bit surprising, since both "s1 in s2" and re.search()
>> could use a Boyer-Moore-based algorithm for a sufficiently-long
>> fixed string, and the time required should be proportional to that
>> needed to set up the skip table.  The re.compile() gets to re-use
>> the table every time.


In article <mailman.2508.1307394262.9059.python-l...@python.org>
Ian  <hobso...@gmail.com> wrote:
>Is that true?  My immediate thought is that Boyer-Moore would quickly give
>the number of characters to skip, but skipping them would be slow because
>UTF8 encoded characters are variable sized, and the string would have to be
>walked anyway.

As I understand it, strings in python 3 are Unicode internally and
(apparently) use wchar_t.  Byte strings in python 3 are of course
byte strings, not UTF-8 encoded.

>Or am I misunderstanding something.

Here's python 2.7 on a Linux box:

    >>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc')
    38 39 40
    >>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc')
    56 60 64

This implies that strings in Python 2.x are just byte strings (same
as b"..." in Python 3.x) and never actually contain unicode; and
unicode strings (same as "..." in Python 3.x) use 4-byte "characters"
per that box's wchar_t.
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: how to avoid leading white spaces

Reply via email to