Re: program to generate data helpful in finding duplicate large files

Chris Kaynor Thu, 18 Sep 2014 11:48:55 -0700

On Thu, Sep 18, 2014 at 11:11 AM, David Alban <exta...@extasia.org> wrote:


> *#!/usr/bin/python*
>
> *import argparse*
> *import hashlib*
> *import os*
> *import re*
> *import socket*
> *import sys*
>
> *from stat import **
>

Generally, from import * imports are discouraged as they tend to populate
your namespace and have issues with accidentally overriding imported
functions/variables. Generally, its more Pythonic to use the other imports
(or import as) and reference with the namespace, as you are doing
everywhere else. The main case where from import * is recommended is API
imports (for example, importing the API of one module into another, such as
for inter-platform, inter-version, or accelerator support).


>
> *ascii_nul = chr(0)*
>
> *     # from:
> http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
> <http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>*
> *     # except that i use hexdigest() rather than digest()*
> *def md5_for_file(f, block_size=2**20):*
> *  md5 = hashlib.md5()*
> *  while True:*
> *    data = f.read(block_size)*
> *    if not data:*
> *      break*
> *    md5.update(data)*
> *  return md5.hexdigest()*
>
> *thishost = socket.gethostname()*
>
> *parser = argparse.ArgumentParser(description='scan files in a tree and
> print a line of information about each regular file')*
> *parser.add_argument('--start-directory', '-d', default='.',
> help='specifies the root of the filesystem tree to be processed')*
> *args = parser.parse_args()*
>
> *start_directory = re.sub( '/+$', '', args.start_directory )*
>

I'm not sure this is actually needed. Its also not platform-independent as
some platforms (eg, Windows) primary uses "\" instead.


>
> *for directory_path, directory_names, file_names in os.walk(
> start_directory ):*
> *  for file_name in file_names:*
> *    file_path = "%s/%s" % ( directory_path, file_name )*
>

os.path.join would be more cross-platform than the string formatting.
Basically, this line would become

file_path = os.path.join(directory_path, file_name)

os.path.join will also ensure that, regardless of the inputs, the paths
will only be joined by a single slash.


> *    lstat_info = os.lstat( file_path )*
>
> *    mode = lstat_info.st_mode*
>
> *    if not S_ISREG( mode ) or S_ISLNK( mode ):*
> *      continue*
>
> *    f = open( file_path, 'r' )*
>
*    md5sum = md5_for_file( f )*
>

The Pythonic thing to do here would be to use a "with" statement to ensure
the file is closed in a timely manner. This requires Python 2.6 or newer
(2.5 works as well with a future directive).
This would require the above two lines to become:

with open( file_path, 'r' ) as f:
    md5sum = md5_for_file( f )


I do note that you never explicitly close the files (which is done via the
with statement in my example). While generally fine as CPython will close
them automatically when no longer referenced, its not a good practice to
get into. Other versions of Python may have delays before the file is
closed, which could then result in errors if processing a huge number of
files. The with statement will ensure the file is closed immediately after
the md5 computation finishes, even if there is an error computing the md5.
Note that in any case, the OS should automatically close the file when the
process exits, but this is likely even worse than relying on Python to
close them for you.

Additionally, you may want to specify binary mode by using open(file_path,
'rb') to ensure platform-independence ('r' uses Universal newlines, which
means on Windows, Python will convert "\r\n" to "\n" while reading the
file). Additionally, some platforms will treat binary files differently.

You may also want to put some additional error handling in here. For
example, the file could be deleted between the "walk" call and the "open"
call, the file may not be readable (locked by other processes, incorrect
permissions, etc). Without knowing your use case, you may need to deal with
those cases, or maybe having the script fail out with an error message is
good enough.


> *    dev   = lstat_info.st_dev*
> *    ino   = lstat_info.st_ino*
> *    nlink = lstat_info.st_nlink*
> *    size  = lstat_info.st_size*
>
> *    sep = ascii_nul*
>
> *    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*
>

You could use sep.join(thishost, md5sum, dev, nio, nlink, size, file_path)
rather than a string format here, presuming all the input values are
strings (you can call the str function on the values to convert them, which
will do the same as the "%s" formatter).

I don't know how much control you have over the output format (you said you
intend to pipe this output into other code), but if you can change it, I
would suggest either using a pure binary format, using a more
human-readable separator than chr(0), or at least providing an argument to
the script to set the separator (I believe Linux has a -0 argument for many
of its scripts).

Also, it seems odd that you include socket.gethostname() in the output, as
that will always be the system you are running the code on, and not the
system you are retrieving data for (os.walk will work on network paths,
including UNC paths).

*exit( 0 )*
>

The only other thing I see is that I would probably break the code into a
few additional functions, and put the argument parsing and initial call
into a "if name == '__main__':" block. This would allow your code to be
imported in the future and called by other Python scripts as a module, as
well as allowing it to be executed as a script from the command line. This
will not matter if you only ever intend to use this script as a
command-line call, but could be useful if you want to reuse the code later
in a larger project.

To do this, however, you would need to make the function yield/return the
results, rather than directly print.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: program to generate data helpful in finding duplicate large files

Reply via email to