Dave Angel wrote:
Jorge wrote:
Hi there,
I'm making a application that reads 3 party generated ASCII files,
but some
times
the files are corrupted totally or partiality and I need to know if
it's a
ASCII file with *nix line terminators.
In linux I can run the file command but the applications should run in
windows.
Any help will be great.
Thank you in advance.
So, which is the assignment:
1) determine if a file has non-ASCII characters
2) determine whether the line-endings are crlf or just lf
In the former case, look at translating the file contents to Unicode,
specifying ASCII as source. If it fails, you have non-ASCII
In the latter case, investigate the 'u' attribute of the mode parameter
in the open() function.
You also need to ask yourself whether you're doing a validation of the
file, or doing a "best guess" like the file command.
Also, realize that ASCII is a 7-bit code, with printing characters all
greater than space, and very few people use delete ('\x7F'), so you
can define a function to determine if a file contains only printing
ASCII and a few control characters. This one is False unless some ink
would be printed.
Python 3.X:
def ascii_file(name, controls=b'\t\n'):
ctrls = set(controls + b' ')
with open(name, 'rb') as f:
chars = set(f.read())
return min(chars) >= min(ctrls) ord('~') >= max(chars)
) and min(chars - ctrls) > ord(' ')
Python 2.X:
def ascii_file(name, controls='\t\n'):
ctrls = set(controls + ' ')
with open(name, 'rb') as f:
chars = set(f.read())
return min(chars) >= min(ctrls) and '~' >= max(chars
) and min(chars - ctrls) > ' '
For potentially more performance (at least on 2.X), you could do min
and max on the data read, and only do the set(data) if the min and
max are OK.
--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list