On Mon, 31 Oct 2011 20:44:45 -0400, Terry Reedy wrote: [...] >> def is_ascii_text(text): >> for c in text: >> if c not in LEGAL: >> return False >> return True > > If text is 3.x bytes, this does not work ;-). OP did not specify bytes > or unicode or Python version.
The OP specified *string*. Whether that's a byte-string in Python 2, or a unicode string in Python 3, it will still work, so long as both `text` and `LEGAL` are strings. [steve@sylar ~]$ python2 -c "print 'a' in 'abc'" # both byte strings True [steve@sylar ~]$ python2 -c "print u'a' in u'abc'" # both unicode strings True [steve@sylar ~]$ python3 -c "print('a' in 'abc')" # both unicode strings True Mixing bytes and characters may or may not do what you expect. [steve@sylar ~]$ python2 -c "print 'a' in u'abc'" # one of each True Whether that's a good thing or a bad thing is open for debate :) >> Algorithmically, that's as efficient as possible: > > This is a bit strange since you go on to explain that it is inefficient > -- O(n*k) where n = text length and k = legal length -- whereas below is > O(n). But since k is a constant, O(nk) is O(n). When using Big Oh notation, it is acceptable to hand-wave away a lot of detail. For example, it's common to assume that n+7, n//7 and n**7 all take the same constant amount of time, which is patently untrue: division and exponentiation both require more work than addition. In this case, relative to the size of the input text, both a linear string search and a constant-time hash lookup are O(1). That doesn't imply that they *literally* take constant time. [...] >> Since all() is guaranteed to keep short-cut semantics, that will be as >> fast as possible in Python, > > A dangerous statement to make. I like living dangerously :) > 'c in legal' has to get hash(c) and look > that up in the hash table, possible skipping around a bit if t If text > is byte string rather than unicode, a simple lookup 'mask[c]', where > mask is a 0-1 byte array, should be faster (see my other post). Oooh, clever! I like! It's not necessary to assume bytes, nor is it necessary to create a bitmask the size of the entire Unicode range. Here's a version for strings (bytes or unicode in Python 2, unicode in Python 3): LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f' MASK = ''.join('\01' if chr(n) in LEGAL else '\0' for n in range(128)) # Untested def is_ascii_text(text): for c in text: n = ord(c) if n >= len(MASK) or MASK[n] == '\0': return False return True Optimizing it is left as an exercise :) I *suspect*, even with any optimizations, that this will be slower than the version using a set. > On my > new Pentium Win 7 machine, it is -- by albout 5%. For 100,000,000 legal > bytes, a minimum of 8.69 versus 9.17 seconds. On my old Linux box, I get the opposite result: set lookup is a tiny bit faster, coincidentally also by almost 5%. >>> from time import time >>> legal_set = frozenset(range(32, 128)) >>> text = b'a' * 100000000 >>> t = time(); all(c in legal_set for c in text); time()-t True 27.174341917037964 >>> >>> legal_ray = 128 * b'\1' >>> t = time(); all(legal_ray[c] for c in text); time()-t True 28.39691996574402 -- Steven -- http://mail.python.org/mailman/listinfo/python-list