On 11/1/2011 2:56 AM, Steven D'Aprano wrote:
On Mon, 31 Oct 2011 20:44:45 -0400, Terry Reedy wrote:
[...]
def is_ascii_text(text):
for c in text:
if c not in LEGAL:
return False
return True
If text is 3.x bytes, this does not work ;-). OP did not specify bytes
or unicode or Python version.
The OP specified *string*.
A. People sometimes use 'string' loosely, to include text stored in bytes.
B. We are solving slightly different problems. The OP specified
terabytes of ascii text on disk that he wants to check for
contamination. For that purpose, using 3.x, it is sensible to read the
data in binary mode into bytes objects rather than decoding into
unicode. (The exception would be if the text uses some 7 bit encoding
like UTF-7. But that is relatively unlikely for disk storage.)
It is irrelevant to that specified purpose whether one calls the
internal implementation type a 'string' or not. While my 3.2 bytes
version was only slightly faster, given data in memory, adding decoding
time for your string version and any other extra overhead for text mode
reading would make the bytes version look even better.
I am pretty sure the best disk reading speed would come from reading
blocks of 4k*N, for some N, in binary mode. If the Python code were
compliled (with Cython, for instance), the process might be input bound,
depending on the system.
'c in legal' has to get hash(c) and look
that up in the hash table, possible skipping around a bit if t If text
is byte string rather than unicode, a simple lookup 'mask[c]', where
mask is a 0-1 byte array, should be faster (see my other post).
Oooh, clever! I like! It's not necessary to assume bytes, nor is it
necessary to create a bitmask the size of the entire Unicode range.
You are right; I had been thinking it would be.
Here's a version for strings (bytes or unicode in Python 2, unicode in
Python 3):
LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
MASK = ''.join('\01' if chr(n) in LEGAL else '\0' for n in range(128))
# Untested
def is_ascii_text(text):
for c in text:
n = ord(c)
if n>= len(MASK) or MASK[n] == '\0': return False
return True
Optimizing it is left as an exercise :)
The test for n >= len() can be accomplished with try:..except IndexError
around the loop construct. Then the explicit loop can be replaced with
any(), as before.
I *suspect*, even with any optimizations, that this will be slower than
the version using a set.
If you suspect that because of the need with true 'strings' for
MASK[ord(c)] instead of MASK[c], you are right. Redoing the test runs
with unicode strings instead of bytes, the set lookup time is about the
same (9.2 seconds) whereas the 100 000 000 ord() calls add over 4
seconds to the MASK lookups, raising them up to 13.1 seconds.
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list