On 10/31/2011 3:54 PM, pyt...@bdurham.com wrote:
Wondering if there's a fast/efficient built-in way to determine if a
string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or Tab?
I presume you also want to disallow the other ascii control chars?
I know I can look at the chars of a string individually and compare them
against a set of legal chars using standard Python code (and this works
If, by 'string', you mean a string of bytes 0-255, then I would, in
Python 3, where bytes contain ints in [0,255], make a byte mask of 256
0s and 1s (not '0's and '1's). Example:
mask = b'\0\1'*121
for c in b'\0\1help': print(mask[c])
1
0
1
0
1
1
In your case, use \1 for forbidden and replace the print with "if
mask[c]: <found illegal>; break"
In 2.x, where iterating byte strings gives length 1 byte strings, you
would need ord(c) as the index, which is much slower.
fine), but I will be working with some very large files in the 100's Gb
to several Tb size range so I'd thought I'd check to see if there was a
built-in in C that might handle this type of check more efficiently.
Does this sound like a use case for cython or pypy?
Cython should get close to c speed, especially with hints. Make sure you
compile something like the above as Py 3 code.
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list