On 15May2014 15:34, Benjamin Schollnick <benja...@schollnick.net> wrote:
I am going to be using this code as part of a web system, and I would love
any feedback, comments and criticism. [...]
I am using scandir from benhoyt to speed up the directory listings, and
data collection. [...]
I had considered using OrderedDicts, but I really didn't see how that would
help the system.

I'm not completely happy with the return_sort_* functions, since they
return two different tuples, one goal was to try to keep everything in the
dictionary, but I couldn't think of a better method.

So any suggestions are welcome.

Firstly, I'm with ChrisA on the whole: let the filesystem/OS do the stat caching. It _is_ actually slower to rely on this than an in-memory cache such as yours, but at least it is reliable because you will then have exactly the same file info state as any other user of the OS. Stale caches are a usability PITA for the end user; witness the continued presence of "Shift-Reload" in most browsers because browser (and proxy) caches get stale.

His remark about relying on the mtime of a directory is correct also, and not just because of clock changes. On UNIX, the mtime of a directory changes _only_ when a filename is added/removed from the directory. To the point, it will not change if a file in the directory is just modified, and therefore your cache will become stale (and stay stale) in that circumstance.

Of course, you wouldn't even be bothering with scandir if you were not concerned about relying of the filesystem/OS. So, it is your call about remembering this stuff. I would include a real-time staleness in addition to the mtime check, eg if the mtime is updated _or_ the system time says it is more than N seconds since the last directory scan, with N being smallish.

Most of my other remarks have to do with style and implementation.

Preqs -
   Scandir - https://github.com/benhoyt/scandir
   scandir is a module which provides a generator version of
   os.listdir() that also exposes the extra file information the
   operating system returns when you iterate a directory.

Personally, my habit is to use os.listdir() and accept the memory cost for two reasons: it is simple and the list is stable. If you iterate over a directory (eg via the C library readdir() facility) to my mind there is scope for the directory changing underneath you. A bit like iterating over a dictionary which is in active use by other code. Using os.listdir() minimises that window.

So I'd be going:

  names = list(scandir(directory_path))

to get a snapshot, and then working with that list.

from stat import ST_MODE, ST_INO, ST_DEV, ST_NLINK, ST_UID, ST_GID, \
                   ST_SIZE, ST_ATIME, ST_MTIME, ST_CTIME

You don't need these. os.lstat() and os.stat() return stat objects, and they have convenient .st_mtime etc attributes. So all your:

  data["st_nlink"] = st[ST_NLINK]

stuff can become:

  data["st_nlink"] = st.st_nlink

I'd also be making your "data" object just an object, and assigning directory to an attribute on it, thus:

  data.st_nlink = st.st_nlink

Much more readable.

You could also avoid the entire issue of manually copying each st.st_blah attribute by just going:

  data.st = st

Then just reach for data.st.st_mtime (or whatever) as needed. Short, quick, avoids needing to worry about optional stat fields - just keep the original stat result.


import time
import scandir

plugin_name = "dir_cache"

#####################################################
class   CachedDirectory(object):
   """
   For example:

       To be added shortly.

It is _well_ work sketching this part out. Since this is just a cache, an example should be quite short.

I find it enormously convenient to sketch some of the use cases here - it helps keep the required method names nicely defined. i.e. sketch a little code that you need to write, then write methods to support the code. You get a much better match that way. I often revisit classes when I use them, because I write some code using the class and think "that was very painful, how would I _like_ to have been able to use this"?

   """
   def __init__(self):
       self.files_to_ignore = ['.ds_store', '.htaccess']
       self.root_path = None

You don't seem to use .root_path?

           # This is the path in the OS that is being examined
           #    (e.g. /Volumes/Users/username/)
       self.directory_cache = {}

This is a CachedDirectory class. I would have just called this .cache - it will make all the code using it less wordy, and hopefully more readable. Unless there may be caches of other stuff too. This is almost entirely a matter of personal style though.

   def _scan_directory_list(self, scan_directory):
       """
           Scan the directory "scan_directory", and save it to the
           self.directory_cache dictionary.

           Low Level function, intended to be used by the populate
function.
       """
       scan_directory = os.path.abspath(scan_directory)
       directories = {}
       files = {}
       self.directory_cache[scan_directory.strip().lower()] = {}
       self.directory_cache[scan_directory.strip().lower()]["number_dirs"] = 0

You write "scan_directory.strip().lower()". It if probably worth going:

  norm_directory = scan_directory.strip().lower()

and saying "norm_directory" throughout. Also, you say:

  self.directory_cache[scan_directory.strip().lower()]

You can also bind:

  cache = self.directory_cache[norm_directory]

right after binding "norm_directory", and say "cache" throughout instead of the longer term. It is naming the same object.

self.directory_cache[scan_directory.strip().lower()]["number_files"] = 0
       for x in scandir.scandir(scan_directory):
           st = x.lstat()
           data = {}

As mentioned earlier, you could usefully define a tiny class:

  class Dirent(object):
    pass

and go:

  data = Dirent()

which will allow you to replace:

           data["fq_filename"] = os.path.realpath(scan_directory).lower() + \
                   os.sep+x.name.strip().lower()

with:

  data.fq_filename = os.path.realpath( os.path.join(norm_directory,
                                                    x.name.strip().lower()) )

Note also the use of os.path.join: it is portable across OSes (eg UNIX versus Windows).

Also, well worth defining:

  norm_name = x.name.strip().lower()

and just saying "norm_name" throughout. More readable, less prone to typing accidents.

Um. A serious point here: why do you go .strip().lower() to all your filenames and pathnames? On a real UNIX system that will usually mean a different object. Many other platforms may have case insensitive names, but will still complaint about names with and without whitespace as different objects. I would even expect os.path.realname to be unreliable if handed such mangled names.

If you are sanitising some input I would recommend doing that elsewhere, and not binding some much special knowledge into this class.

Maybe you have good reasons for that. At the least those reasons need to be expressed in comments (best in the opening docstring), because otherwise I would recommend strongly to make the cache class as naive about names as possible (i.e. do not mangle them).

           data["parentdirectory"] = os.sep.join(\
                   os.path.split(scan_directory)[0:-1])

Use os.path.dirname.

           data["st_mode"] = st[ST_MODE]
           data["st_inode"] = st[ST_INO]
           data["st_dev"] = st[ST_DEV]
           data["st_nlink"] = st[ST_NLINK]
           data["st_uid"] = st[ST_UID]
           data["st_gid"] = st[ST_GID]
           data["compressed"] = st[ST_SIZE]

"compressed" is worth a comment.

           data["st_size"] = st[ST_SIZE]       #10
           data["st_atime"] = st[ST_ATIME]     #11
           data["raw_st_mtime"] = st[ST_MTIME] #12
           data["st_mtime"] = time.asctime(time.localtime(st[ST_MTIME]))

I strongly recommend against this.

Keep "st_mtime" (or data.st.st_mtime per my earlier suggestion) as the raw mtime. Any human readable transcription of the time should get a name like "human_st_mtime".

Better still, if you make a Dirent class for "data", you can give it a property like:

  @property
  def human_st_mtime(self):
    return time.asctime(time.localtime(self.st.st_mtime))

and just use ".human.st.mtime" like any other attribute later when needed. Note: no brackets.

           data["st_ctime"] = st[ST_CTIME]
           if not x.name.strip().lower() in self.files_to_ignore:

Personally I phrase this as "a not in b" instead of "not a in b".

Also, if you really are ignoring it, you could go:

  if norm_name in self.files_to_ignore:
    # do nothing else with this entry
    continue

and save yourself some indentation below. Not everyone likes this.

   def directory_in_cache(self, scan_directory):
       """
           Pass the target directory

           Will return True if the directory is already cached
           Will return False if the directory is not already cached
       """
       scan_directory = os.path.realpath(scan_directory).lower().strip()

Probably worth making normalisation function:

  def norm_name(name):
    return name.lower().strip()

becuase you do this a lot. Using a function ensures you do it the same way everywhere and makes it obvious that it is the same thing with the same purpose everywhere.

       return scan_directory in self.directory_cache.keys()

You can just say:

  return scan_directory in self.directory_cache

and for added points you can rename "directory_in_cache" to "__contains__". That way from outside (or inside) you use "in" directly. Example:

  if "foo" in self:

Cheers,
Cameron Simpson <c...@zip.com.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to