On 11/1/2011 2:56 AM, Steven D'Aprano wrote:
On Mon, 31 Oct 2011 20:44:45 -0400, Terry Reedy wrote:

[...]
def is_ascii_text(text):
      for c in text:
          if c not in LEGAL:
              return False
      return True

If text is 3.x bytes, this does not work ;-). OP did not specify bytes
or unicode or Python version.

The OP specified *string*.

A. People sometimes use 'string' loosely, to include text stored in bytes.

B. We are solving slightly different problems. The OP specified terabytes of ascii text on disk that he wants to check for contamination. For that purpose, using 3.x, it is sensible to read the data in binary mode into bytes objects rather than decoding into unicode. (The exception would be if the text uses some 7 bit encoding like UTF-7. But that is relatively unlikely for disk storage.)

It is irrelevant to that specified purpose whether one calls the internal implementation type a 'string' or not. While my 3.2 bytes version was only slightly faster, given data in memory, adding decoding time for your string version and any other extra overhead for text mode reading would make the bytes version look even better.

I am pretty sure the best disk reading speed would come from reading blocks of 4k*N, for some N, in binary mode. If the Python code were compliled (with Cython, for instance), the process might be input bound, depending on the system.

'c in legal' has to get hash(c) and look
that up in the hash table, possible skipping around a bit if t If text
is byte string rather than unicode, a simple lookup 'mask[c]', where
mask is a 0-1 byte array, should be faster (see my other post).

Oooh, clever! I like! It's not necessary to assume bytes, nor is it
necessary to create a bitmask the size of the entire Unicode range.

You are right; I had been thinking it would be.

Here's a version for strings (bytes or unicode in Python 2, unicode in
Python 3):

LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
MASK = ''.join('\01' if chr(n) in LEGAL else '\0' for n in range(128))

# Untested
def is_ascii_text(text):
     for c in text:
         n = ord(c)
         if n>= len(MASK) or MASK[n] == '\0': return False
     return True

Optimizing it is left as an exercise :)

The test for n >= len() can be accomplished with try:..except IndexError around the loop construct. Then the explicit loop can be replaced with any(), as before.

I *suspect*, even with any optimizations, that this will be slower than
the version using a set.

If you suspect that because of the need with true 'strings' for MASK[ord(c)] instead of MASK[c], you are right. Redoing the test runs with unicode strings instead of bytes, the set lookup time is about the same (9.2 seconds) whereas the 100 000 000 ord() calls add over 4 seconds to the MASK lookups, raising them up to 13.1 seconds.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to