pyt...@bdurham.com, 31.10.2011 20:54:
Wondering if there's a fast/efficient built-in way to determine
if a string has non-ASCII chars outside the range ASCII 32-127,
CR, LF, or Tab?

I know I can look at the chars of a string individually and
compare them against a set of legal chars using standard Python
code (and this works fine), but I will be working with some very
large files in the 100's Gb to several Tb size range so I'd
thought I'd check to see if there was a built-in in C that might
handle this type of check more efficiently.

Does this sound like a use case for cython or pypy?

Cython. For data of that size, likely read from a fast local RAID drive I guess, you certainly don't want to read (part of) the file into a memory buffer, then copy that memory buffer into a Python bytes string, and then search it character by character, copying each of the characters into a new string object just to compare them.

Instead, you'd want to use low-level I/O to read a not-so-small part of the file into a memory buffer, run through it looking for unwanted characters, and then read the next chunk, without any further copying. The comparison loop could look like this, for example:

    cdef unsigned char current_byte
    cdef unsigned char* byte_buffer = libc.stdlib.malloc(BUFFER_SIZE)

    # while read chunk ...
    for current_byte in byte_buffer[:BUFFER_SIZE]:
        if current_byte < 32 or current_byte > 127:
            if current_byte not in b'\t\r\n':
                raise ValueError()

What kind of I/O API you use is up to you. You may want to use the functions declared in libc.stdio (ships with Cython).

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to