out utf-8 chars)

Terry Reedy Sun, 14 Jul 2013 21:22:35 -0700

On 7/14/2013 10:56 AM, Chris Angelico wrote:

On Sun, Jul 14, 2013 at 11:44 PM,  <wxjmfa...@gmail.com> wrote:

timeit.repeat("a = 'hundred'; 'x' in a")

[0.11785943134991479, 0.09850454944486256, 0.09761604599423179]

timeit.repeat("a = 'hundreœ'; 'x' in a")

[0.23955250303158593, 0.2195812612416752, 0.22133896997401692]

sys.version

'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'

As issue about finding stings in strings was opened last September and,as reported on this list, fixes were applied about last March. As Iremember, some but not all of the optimizations were applied to 3.3.Perhaps some were applied too late for 3.3.1 (3.3.2 is 3.3.1 with someemergency patches to correct regressions).


Python 3.4.0a2:
>>> import timeit
>>> timeit.repeat("a = 'hundred'; 'x' in a")
[0.17396483610667152, 0.16277956641670813, 0.1627937074749941]
>>> timeit.repeat("a = 'hundreo'; 'x' in a")
[0.18441108179403187, 0.16277311071618783, 0.16270517215355085]

The difference is gone, again, as previously reported.

jmf has raised an interesting point. Some string membership operations
do seem oddly slow.


He raised it a year ago and action was taken.


# Get ourselves a longish ASCII string with no duplicates - escape
apostrophe and backslash for code later on

asciichars=''.join(chr(i) for i in 
range(32,128)).replace("\\",r"\\").replace("'",r"\'")
haystack=[

        ("ASCII",asciichars+"\u0001"),
        ("BMP",asciichars+"\u1234"),
        ("SMP",asciichars+"\U00012345"),
]

needle=[

        ("ASCII","\u0002"),
        ("BMP","\u1235"),
        ("SMP","\U00012346"),
]

useset=[

        ("",""),
        (", as set","; a=set(a)"),
]

for time,desc in sorted((min(timeit.repeat("'%s' in a"%n,("a='%s'"%h)+s)),"%s in 
%s%s"%(nd,hd,sd)) for nd,n in needle for hd,h in haystack for sd,s in useset):

        print("%.10f %s"%(time,desc))

0.1765129367 ASCII in ASCII, as set
0.1767096097 BMP in SMP, as set
0.1778647845 ASCII in BMP, as set
0.1785266004 BMP in BMP, as set
0.1789093307 SMP in SMP, as set
0.1790431465 SMP in BMP, as set
0.1796504863 BMP in ASCII, as set
0.1803854959 SMP in ASCII, as set
0.1810674262 ASCII in SMP, as set


Much of this time is overhead; 'pass' would not run too much faster.

0.1817367850 SMP in BMP
0.1884555160 SMP in ASCII
0.2132371572 BMP in ASCII

For these, 3.3 does no searching because it knows from the internal charkind that the answer is No without looking.

0.3137454621 ASCII in ASCII
0.4472624314 BMP in BMP
0.6672795006 SMP in SMP
0.7493052888 ASCII in BMP
0.9261783271 ASCII in SMP
0.9865787412 BMP in SMP

...

Set membership is faster than string membership, though marginally on
something this short. If the needle is wider than the haystack, it
obviously can't be present, so a false return comes back at the speed
of a set check.

Jim ignores these cases where 3.3+ uses the information about the maxcodepoint to do the operation much faster than in 3.2.

Otherwise, an actual search must be done. Searching
for characters in strings of the same width gets slower as the strings
get larger in memory (unsurprising). What I'm seeing of the top-end
results, though, is that the search for a narrower string in a wider
one is quite significantly slower.


50% longer is not bad, even

I don't know of an actual proven use-case for this, but it seems
likely to happen (eg you take user input and want to know if there are
any HTML-sensitive characters in it, so you check ('<' in string or
'&' in string), for instance).


In my editing of code, I nearly always search for words or long names.

 The question is, is it worth

constructing an "expanded string" at the haystack's width prior to
doing the search?

I would not make any assumptions about what Python does or does not dowithout checking the code. All I know is that Python uses a modifiedversion of one of the pre-process and skip-forward algorithms(Boyer-Moore?, Knuth-Pratt?, I forget). These are designed to workefficiently with needles longer than 1 char, and indeed may work betterwith longer needles. Searching for an single char in n chars is O(n).Searching for a len m needle is potentially O(m*n) and the point of thefancy algorithms is make all searches as close to O(n) as possible.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: Timing of string membership (was Re: hex dump w/ or w/out utf-8 chars)

Reply via email to