pyt...@bdurham.com, 31.10.2011 20:54:
Wondering if there's a fast/efficient built-in way to determine
if a string has non-ASCII chars outside the range ASCII 32-127,
CR, LF, or Tab?
I know I can look at the chars of a string individually and
compare them against a set of legal chars using standard Python
code (and this works fine), but I will be working with some very
large files in the 100's Gb to several Tb size range so I'd
thought I'd check to see if there was a built-in in C that might
handle this type of check more efficiently.
Does this sound like a use case for cython or pypy?
Cython. For data of that size, likely read from a fast local RAID drive I
guess, you certainly don't want to read (part of) the file into a memory
buffer, then copy that memory buffer into a Python bytes string, and then
search it character by character, copying each of the characters into a new
string object just to compare them.
Instead, you'd want to use low-level I/O to read a not-so-small part of the
file into a memory buffer, run through it looking for unwanted characters,
and then read the next chunk, without any further copying. The comparison
loop could look like this, for example:
cdef unsigned char current_byte
cdef unsigned char* byte_buffer = libc.stdlib.malloc(BUFFER_SIZE)
# while read chunk ...
for current_byte in byte_buffer[:BUFFER_SIZE]:
if current_byte < 32 or current_byte > 127:
if current_byte not in b'\t\r\n':
raise ValueError()
What kind of I/O API you use is up to you. You may want to use the
functions declared in libc.stdio (ships with Cython).
Stefan
--
http://mail.python.org/mailman/listinfo/python-list