On Thu, Sep 18, 2014 at 11:11 AM, David Alban <exta...@extasia.org> wrote:
> *#!/usr/bin/python* > > *import argparse* > *import hashlib* > *import os* > *import re* > *import socket* > *import sys* > > *from stat import ** > Generally, from import * imports are discouraged as they tend to populate your namespace and have issues with accidentally overriding imported functions/variables. Generally, its more Pythonic to use the other imports (or import as) and reference with the namespace, as you are doing everywhere else. The main case where from import * is recommended is API imports (for example, importing the API of one module into another, such as for inter-platform, inter-version, or accelerator support). > > *ascii_nul = chr(0)* > > * # from: > http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python > <http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>* > * # except that i use hexdigest() rather than digest()* > *def md5_for_file(f, block_size=2**20):* > * md5 = hashlib.md5()* > * while True:* > * data = f.read(block_size)* > * if not data:* > * break* > * md5.update(data)* > * return md5.hexdigest()* > > *thishost = socket.gethostname()* > > *parser = argparse.ArgumentParser(description='scan files in a tree and > print a line of information about each regular file')* > *parser.add_argument('--start-directory', '-d', default='.', > help='specifies the root of the filesystem tree to be processed')* > *args = parser.parse_args()* > > *start_directory = re.sub( '/+$', '', args.start_directory )* > I'm not sure this is actually needed. Its also not platform-independent as some platforms (eg, Windows) primary uses "\" instead. > > *for directory_path, directory_names, file_names in os.walk( > start_directory ):* > * for file_name in file_names:* > * file_path = "%s/%s" % ( directory_path, file_name )* > os.path.join would be more cross-platform than the string formatting. Basically, this line would become file_path = os.path.join(directory_path, file_name) os.path.join will also ensure that, regardless of the inputs, the paths will only be joined by a single slash. > * lstat_info = os.lstat( file_path )* > > * mode = lstat_info.st_mode* > > * if not S_ISREG( mode ) or S_ISLNK( mode ):* > * continue* > > * f = open( file_path, 'r' )* > * md5sum = md5_for_file( f )* > The Pythonic thing to do here would be to use a "with" statement to ensure the file is closed in a timely manner. This requires Python 2.6 or newer (2.5 works as well with a future directive). This would require the above two lines to become: with open( file_path, 'r' ) as f: md5sum = md5_for_file( f ) I do note that you never explicitly close the files (which is done via the with statement in my example). While generally fine as CPython will close them automatically when no longer referenced, its not a good practice to get into. Other versions of Python may have delays before the file is closed, which could then result in errors if processing a huge number of files. The with statement will ensure the file is closed immediately after the md5 computation finishes, even if there is an error computing the md5. Note that in any case, the OS should automatically close the file when the process exits, but this is likely even worse than relying on Python to close them for you. Additionally, you may want to specify binary mode by using open(file_path, 'rb') to ensure platform-independence ('r' uses Universal newlines, which means on Windows, Python will convert "\r\n" to "\n" while reading the file). Additionally, some platforms will treat binary files differently. You may also want to put some additional error handling in here. For example, the file could be deleted between the "walk" call and the "open" call, the file may not be readable (locked by other processes, incorrect permissions, etc). Without knowing your use case, you may need to deal with those cases, or maybe having the script fail out with an error message is good enough. > * dev = lstat_info.st_dev* > * ino = lstat_info.st_ino* > * nlink = lstat_info.st_nlink* > * size = lstat_info.st_size* > > * sep = ascii_nul* > > * print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep, > dev, sep, ino, sep, nlink, sep, size, sep, file_path )* > You could use sep.join(thishost, md5sum, dev, nio, nlink, size, file_path) rather than a string format here, presuming all the input values are strings (you can call the str function on the values to convert them, which will do the same as the "%s" formatter). I don't know how much control you have over the output format (you said you intend to pipe this output into other code), but if you can change it, I would suggest either using a pure binary format, using a more human-readable separator than chr(0), or at least providing an argument to the script to set the separator (I believe Linux has a -0 argument for many of its scripts). Also, it seems odd that you include socket.gethostname() in the output, as that will always be the system you are running the code on, and not the system you are retrieving data for (os.walk will work on network paths, including UNC paths). *exit( 0 )* > The only other thing I see is that I would probably break the code into a few additional functions, and put the argument parsing and initial call into a "if name == '__main__':" block. This would allow your code to be imported in the future and called by other Python scripts as a module, as well as allowing it to be executed as a script from the command line. This will not matter if you only ever intend to use this script as a command-line call, but could be useful if you want to reuse the code later in a larger project. To do this, however, you would need to make the function yield/return the results, rather than directly print.
-- https://mail.python.org/mailman/listinfo/python-list