Greetings, >If you are parsing files in a directory what is the best way to >record which files were actioned? > >So that if i re-parse the directory i only parse the new files in >the directory?
How will you know that the files are new? If a file has exactly the same content as another file, but a different name, is it new? Often this depends on the characteristics of the system in which your (planned) software is operating. Peter Otten has also asked for some more context, which would help us give you some tips that are more targetted to the problem you are trying to solve. But, I'll just forge ahead and make some assumptions: * You are watching a directory for new/changed files. * New files are appearing regularly. * Contents of old files get updated and you want to know. Have you ever seen an MD5SUMS file? Do you know what a content hash is? You could find a place to store the content hash (a.k.a. digest) of each file that you process. Below is a program that should work in Python2 and Python3. You could use this sort of approach as part of your solution. In order to make sure you have handled a file before, you should store and compare two things. 1. The filename. 2. The content hash. Note: If you are sure the content is not going to change, then just use the filename to track whether you have handled something or not. How would you use this tracking info ? * Create a dictionary (or a set), e.g.: handled = dict() handled[('410c35da37b9a25d9b5d701753b011e5','setup.py')] = time.time() Lasts only as long as the program runs. But, you will know that you have handled any file by the tuple of its content hash and filename. * Store the filename (and/or digest) in a database. So many options: sqlite, pickle, anydbm, text file of your own crafting, SQLAlchemy ... * Create a file, hardlink or symlink in the filesystem (in the same directory or another directory), e.g.: trackingfile = os.path.join('another-directory', 'setup.py') with open(trackingfile, 'w') as f: f.write('410c35da37b9a25d9b5d701753b011e5') OR os.symlink('setup.py', '410c35da37b9a25d9b5d701753b011e5-setup.py') Now, you can also examine your little cache of handled files to compare for when the content hash changes. If the system is an automated system, then this can be perfectly fine. If humans create the files, I would suggest not doing this. Humans tend to be easily confused by such things (and then want to delete the files or just be intimidated by them; scary hashes!). There are lots of options, but without some more context, we can only make generic suggestions. So, I'll stop with my generic suggestions now. Have fun and good luck! -Martin #! /usr/bin/python from __future__ import print_function import os import sys import logging import hashlib logformat = '%(levelname)-9s %(name)s %(filename)s#%(lineno)s ' \ + '%(funcName)s %(message)s' logging.basicConfig(stream=sys.stderr, format=logformat, level=logging.ERROR) logger = logging.getLogger(__name__) def hashthatfile(fname): contenthash = hashlib.md5() try: with open(fname, 'rb') as f: contenthash.update(f.read()) return contenthash.hexdigest() except IOError as e: logger.warning("See exception below; skipping file %s", fname) logger.exception(e) return None def main(dirname): for fname in os.listdir(dirname): if not os.path.isfile(fname): logger.debug("Skipping non-file %s", fname) continue logger.info("Found file %s", fname) digest = hashthatfile(fname) logger.info("Computed MD5 hash digest %s", digest) print('%s %s' % (digest, fname,)) return os.EX_OK if __name__ == '__main__': if len(sys.argv) == 1: sys.exit(main(os.getcwd())) else: sys.exit(main(sys.argv[1])) # -- end of file -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list