Lloyd Zusman wrote:
Perl has the following constructs to check whether a file is considered
to contain "text" or "binary" data:
if (-T $filename) { print "file contains 'text' characters\n"; }
if (-B $filename) { print "file contains 'binary' characters\n"; }
Is there already a Python analog to these? I'm happy to write them on
my own if no such constructs currently exist, but before I start, I'd
like to make sure that I'm not "re-inventing the wheel".
By the way, here's what the perl docs say about these constructs. I'm
looking for something similar in Python:
... The -T and -B switches work as follows. The first block or so
... of the file is examined for odd characters such as strange control
... codes or characters with the high bit set. If too many strange
... characters (>30%) are found, it's a -B file; otherwise it's a -T
... file. Also, any file containing null in the first block is
... considered a binary file. [ ... ]
While I agree with the others who have responded along the lines
of "that's a hinky heuristic", it's not too hard to write an analog:
import string
def is_text(fname,
chars=set(string.printable),
threshold=0.3,
portion=1024, # read a kilobyte to find out
mode='rb',
):
assert portion is None or portion > 0
assert 0 < threshold < 1
f = file(fname, mode)
if portion is None:
content = iter(f)
else:
content = iter(f.read(int(portion)))
f.close()
total = valid = 0
for c in content:
if c in chars:
valid += 1
total += 1
return (float(valid)/total) > threshold
def is_bin(*args, **kwargs):
return not is_text(*args, **kwargs)
for fname in (
'/usr/bin/abiword',
'/home/tkc/.bashrc',
):
print fname, is_text(fname)
It should allow you to tweak the charset to consider "text",
defaulting to string.printable, but adjust the "text" chars and
the file-reading-mode accordingly if you're using unicode text
(perhaps inverting the logic to make it an "binary chars" set).
You can also change the threshold from 0.3 (30%) to whatever you
need, and test the entire file or a subset of it (this defaults
to just reading the first K of the file, but if you pass None for
the portion, it will read the whole thing, even if it's a TB file).
-tkc
--
http://mail.python.org/mailman/listinfo/python-list