Re: Python version of perl's "if (-T ..)" and "if (-B ...)"?

Tim Chase Fri, 12 Feb 2010 07:04:41 -0800

Lloyd Zusman wrote:

Perl has the following constructs to check whether a file is considered
to contain "text" or "binary" data:


if (-T $filename) { print "file contains 'text' characters\n"; }
if (-B $filename) { print "file contains 'binary' characters\n"; }

Is there already a Python analog to these? I'm happy to write them on
my own if no such constructs currently exist, but before I start, I'd
like to make sure that I'm not "re-inventing the wheel".

By the way, here's what the perl docs say about these constructs. I'm
looking for something similar in Python:

... The -T  and -B  switches work as follows. The first block or so
... of the file is examined for odd characters such as strange control
... codes or characters with the high bit set. If too many strange
... characters (>30%) are found, it's a -B file; otherwise it's a -T
... file. Also, any file containing null in the first block is
... considered a binary file. [ ... ]

While I agree with the others who have responded along the linesof "that's a hinky heuristic", it's not too hard to write an analog:


  import string
  def is_text(fname,
      chars=set(string.printable),
      threshold=0.3,
      portion=1024, # read a kilobyte to find out
      mode='rb',
      ):
    assert portion is None or portion > 0
    assert 0 < threshold < 1
    f = file(fname, mode)
    if portion is None:
      content = iter(f)
    else:
      content = iter(f.read(int(portion)))
    f.close()
    total = valid = 0
    for c in content:
      if c in chars:
        valid += 1
      total += 1
    return (float(valid)/total) > threshold

  def is_bin(*args, **kwargs):
    return not is_text(*args, **kwargs)

  for fname in (
      '/usr/bin/abiword',
      '/home/tkc/.bashrc',
      ):
    print fname, is_text(fname)

It should allow you to tweak the charset to consider "text",defaulting to string.printable, but adjust the "text" chars andthe file-reading-mode accordingly if you're using unicode text(perhaps inverting the logic to make it an "binary chars" set).You can also change the threshold from 0.3 (30%) to whatever youneed, and test the entire file or a subset of it (this defaultsto just reading the first K of the file, but if you pass None forthe portion, it will read the whole thing, even if it's a TB file).


-tkc





--
http://mail.python.org/mailman/listinfo/python-list

Re: Python version of perl's "if (-T ..)" and "if (-B ...)"?

Reply via email to