rbt wrote: > The script is too long to post in its entirety. In short, I open the > files, do a binary read (in 1MB chunks for ease of memory usage) on them > before placing that read into a variable and that in turn into a list > that I then apply the following re to > > ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b') > > like this: > > for chunk in whole_file: > search = ss.findall(chunk) > if search: > validate(search)
This seems so obvious that I hesitate to ask, but is the above really a simplification of the real code, which actually handles the case of SSNs that lie over the boundary between chunks? In other words, what happens if the first chunk has only the first four digits of the SSN, and the rest lies in the second chunk? -Peter -- http://mail.python.org/mailman/listinfo/python-list