Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Stefan Behnel Tue, 01 Nov 2011 12:50:33 -0700

pyt...@bdurham.com, 31.10.2011 20:54:

Wondering if there's a fast/efficient built-in way to determine
if a string has non-ASCII chars outside the range ASCII 32-127,
CR, LF, or Tab?


I know I can look at the chars of a string individually and
compare them against a set of legal chars using standard Python
code (and this works fine), but I will be working with some very
large files in the 100's Gb to several Tb size range so I'd
thought I'd check to see if there was a built-in in C that might
handle this type of check more efficiently.

Does this sound like a use case for cython or pypy?

Cython. For data of that size, likely read from a fast local RAID drive Iguess, you certainly don't want to read (part of) the file into a memorybuffer, then copy that memory buffer into a Python bytes string, and thensearch it character by character, copying each of the characters into a newstring object just to compare them.

Instead, you'd want to use low-level I/O to read a not-so-small part of thefile into a memory buffer, run through it looking for unwanted characters,and then read the next chunk, without any further copying. The comparisonloop could look like this, for example:


    cdef unsigned char current_byte
    cdef unsigned char* byte_buffer = libc.stdlib.malloc(BUFFER_SIZE)

    # while read chunk ...
    for current_byte in byte_buffer[:BUFFER_SIZE]:
        if current_byte < 32 or current_byte > 127:
            if current_byte not in b'\t\r\n':
                raise ValueError()

What kind of I/O API you use is up to you. You may want to use thefunctions declared in libc.stdio (ships with Cython).


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Reply via email to