Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Terry Reedy Tue, 01 Nov 2011 16:31:23 -0700

On 11/1/2011 2:56 AM, Steven D'Aprano wrote:

On Mon, 31 Oct 2011 20:44:45 -0400, Terry Reedy wrote:


[...]

def is_ascii_text(text):
      for c in text:
          if c not in LEGAL:
              return False
      return True


If text is 3.x bytes, this does not work ;-). OP did not specify bytes
or unicode or Python version.


The OP specified *string*.


A. People sometimes use 'string' loosely, to include text stored in bytes.

B. We are solving slightly different problems. The OP specifiedterabytes of ascii text on disk that he wants to check forcontamination. For that purpose, using 3.x, it is sensible to read thedata in binary mode into bytes objects rather than decoding intounicode. (The exception would be if the text uses some 7 bit encodinglike UTF-7. But that is relatively unlikely for disk storage.)

It is irrelevant to that specified purpose whether one calls theinternal implementation type a 'string' or not. While my 3.2 bytesversion was only slightly faster, given data in memory, adding decodingtime for your string version and any other extra overhead for text modereading would make the bytes version look even better.

I am pretty sure the best disk reading speed would come from readingblocks of 4k*N, for some N, in binary mode. If the Python code werecompliled (with Cython, for instance), the process might be input bound,depending on the system.

'c in legal' has to get hash(c) and look
that up in the hash table, possible skipping around a bit if t If text
is byte string rather than unicode, a simple lookup 'mask[c]', where
mask is a 0-1 byte array, should be faster (see my other post).


Oooh, clever! I like! It's not necessary to assume bytes, nor is it
necessary to create a bitmask the size of the entire Unicode range.


You are right; I had been thinking it would be.

Here's a version for strings (bytes or unicode in Python 2, unicode in
Python 3):

LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
MASK = ''.join('\01' if chr(n) in LEGAL else '\0' for n in range(128))

# Untested
def is_ascii_text(text):
     for c in text:
         n = ord(c)
         if n>= len(MASK) or MASK[n] == '\0': return False
     return True

Optimizing it is left as an exercise :)

The test for n >= len() can be accomplished with try:..except IndexErroraround the loop construct. Then the explicit loop can be replaced withany(), as before.

I *suspect*, even with any optimizations, that this will be slower than
the version using a set.

If you suspect that because of the need with true 'strings' forMASK[ord(c)] instead of MASK[c], you are right. Redoing the test runswith unicode strings instead of bytes, the set lookup time is about thesame (9.2 seconds) whereas the 100 000 000 ord() calls add over 4seconds to the MASK lookups, raising them up to 13.1 seconds.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Reply via email to