Lloyd Zusman wrote:
Perl has the following constructs to check whether a file is considered
to contain "text" or "binary" data:

if (-T $filename) { print "file contains 'text' characters\n"; }
if (-B $filename) { print "file contains 'binary' characters\n"; }

Is there already a Python analog to these? I'm happy to write them on
my own if no such constructs currently exist, but before I start, I'd
like to make sure that I'm not "re-inventing the wheel".

By the way, here's what the perl docs say about these constructs. I'm
looking for something similar in Python:

... The -T  and -B  switches work as follows. The first block or so
... of the file is examined for odd characters such as strange control
... codes or characters with the high bit set. If too many strange
... characters (>30%) are found, it's a -B file; otherwise it's a -T
... file. Also, any file containing null in the first block is
... considered a binary file. [ ... ]

While I agree with the others who have responded along the lines of "that's a hinky heuristic", it's not too hard to write an analog:

  import string
  def is_text(fname,
      chars=set(string.printable),
      threshold=0.3,
      portion=1024, # read a kilobyte to find out
      mode='rb',
      ):
    assert portion is None or portion > 0
    assert 0 < threshold < 1
    f = file(fname, mode)
    if portion is None:
      content = iter(f)
    else:
      content = iter(f.read(int(portion)))
    f.close()
    total = valid = 0
    for c in content:
      if c in chars:
        valid += 1
      total += 1
    return (float(valid)/total) > threshold

  def is_bin(*args, **kwargs):
    return not is_text(*args, **kwargs)

  for fname in (
      '/usr/bin/abiword',
      '/home/tkc/.bashrc',
      ):
    print fname, is_text(fname)

It should allow you to tweak the charset to consider "text", defaulting to string.printable, but adjust the "text" chars and the file-reading-mode accordingly if you're using unicode text (perhaps inverting the logic to make it an "binary chars" set). You can also change the threshold from 0.3 (30%) to whatever you need, and test the entire file or a subset of it (this defaults to just reading the first K of the file, but if you pass None for the portion, it will read the whole thing, even if it's a TB file).

-tkc





--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to