Re: Directory Caching, suggestions and comments?

Cameron Simpson Thu, 15 May 2014 17:29:27 -0700

On 15May2014 15:34, Benjamin Schollnick <benja...@schollnick.net> wrote:

I am going to be using this code as part of a web system, and I would love
any feedback, comments and criticism. [...]
I am using scandir from benhoyt to speed up the directory listings, and
data collection. [...]
I had considered using OrderedDicts, but I really didn't see how that would
help the system.


I'm not completely happy with the return_sort_* functions, since they
return two different tuples, one goal was to try to keep everything in the
dictionary, but I couldn't think of a better method.

So any suggestions are welcome.

Firstly, I'm with ChrisA on the whole: let the filesystem/OS do the statcaching. It _is_ actually slower to rely on this than an in-memory cache suchas yours, but at least it is reliable because you will then have exactly thesame file info state as any other user of the OS. Stale caches are a usabilityPITA for the end user; witness the continued presence of "Shift-Reload" in mostbrowsers because browser (and proxy) caches get stale.

His remark about relying on the mtime of a directory is correct also, and notjust because of clock changes. On UNIX, the mtime of a directory changes _only_when a filename is added/removed from the directory. To the point, it will notchange if a file in the directory is just modified, and therefore your cachewill become stale (and stay stale) in that circumstance.

Of course, you wouldn't even be bothering with scandir if you were notconcerned about relying of the filesystem/OS. So, it is your call aboutremembering this stuff. I would include a real-time staleness in addition tothe mtime check, eg if the mtime is updated _or_ the system time says it ismore than N seconds since the last directory scan, with N being smallish.


Most of my other remarks have to do with style and implementation.

Preqs -
   Scandir - https://github.com/benhoyt/scandir
   scandir is a module which provides a generator version of
   os.listdir() that also exposes the extra file information the
   operating system returns when you iterate a directory.

Personally, my habit is to use os.listdir() and accept the memory cost for tworeasons: it is simple and the list is stable. If you iterate over a directory(eg via the C library readdir() facility) to my mind there is scope for thedirectory changing underneath you. A bit like iterating over a dictionary whichis in active use by other code. Using os.listdir() minimises that window.


So I'd be going:

  names = list(scandir(directory_path))

to get a snapshot, and then working with that list.

from stat import ST_MODE, ST_INO, ST_DEV, ST_NLINK, ST_UID, ST_GID, \
                   ST_SIZE, ST_ATIME, ST_MTIME, ST_CTIME

You don't need these. os.lstat() and os.stat() return stat objects, and theyhave convenient .st_mtime etc attributes. So all your:


  data["st_nlink"] = st[ST_NLINK]

stuff can become:

  data["st_nlink"] = st.st_nlink

I'd also be making your "data" object just an object, and assigning directoryto an attribute on it, thus:


  data.st_nlink = st.st_nlink

Much more readable.

You could also avoid the entire issue of manually copying each st.st_blahattribute by just going:


  data.st = st

Then just reach for data.st.st_mtime (or whatever) as needed. Short, quick,avoids needing to worry about optional stat fields - just keep the originalstat result.

import time
import scandir

plugin_name = "dir_cache"

#####################################################
class   CachedDirectory(object):
   """
   For example:

       To be added shortly.

It is _well_ work sketching this part out. Since this is just a cache, anexample should be quite short.

I find it enormously convenient to sketch some of the use cases here - it helpskeep the required method names nicely defined. i.e. sketch a little code thatyou need to write, then write methods to support the code. You get a muchbetter match that way. I often revisit classes when I use them, because I writesome code using the class and think "that was very painful, how would I _like_to have been able to use this"?

   """
   def __init__(self):
       self.files_to_ignore = ['.ds_store', '.htaccess']
       self.root_path = None


You don't seem to use .root_path?

           # This is the path in the OS that is being examined
           #    (e.g. /Volumes/Users/username/)
       self.directory_cache = {}

This is a CachedDirectory class. I would have just called this .cache - it willmake all the code using it less wordy, and hopefully more readable. Unlessthere may be caches of other stuff too. This is almost entirely a matter ofpersonal style though.

   def _scan_directory_list(self, scan_directory):
       """
           Scan the directory "scan_directory", and save it to the
           self.directory_cache dictionary.

           Low Level function, intended to be used by the populate
function.
       """
       scan_directory = os.path.abspath(scan_directory)
       directories = {}
       files = {}
       self.directory_cache[scan_directory.strip().lower()] = {}
       self.directory_cache[scan_directory.strip().lower()]["number_dirs"] = 0


You write "scan_directory.strip().lower()". It if probably worth going:

  norm_directory = scan_directory.strip().lower()

and saying "norm_directory" throughout. Also, you say:

  self.directory_cache[scan_directory.strip().lower()]

You can also bind:

  cache = self.directory_cache[norm_directory]

right after binding "norm_directory", and say "cache" throughout instead of thelonger term. It is naming the same object.

self.directory_cache[scan_directory.strip().lower()]["number_files"] = 0
       for x in scandir.scandir(scan_directory):
           st = x.lstat()
           data = {}


As mentioned earlier, you could usefully define a tiny class:

  class Dirent(object):
    pass

and go:

  data = Dirent()

which will allow you to replace:

           data["fq_filename"] = os.path.realpath(scan_directory).lower() + \
                   os.sep+x.name.strip().lower()


with:

  data.fq_filename = os.path.realpath( os.path.join(norm_directory,
                                                    x.name.strip().lower()) )

Note also the use of os.path.join: it is portable across OSes (eg UNIX versusWindows).


Also, well worth defining:

  norm_name = x.name.strip().lower()

and just saying "norm_name" throughout. More readable, less prone to typingaccidents.

Um. A serious point here: why do you go .strip().lower() to all your filenamesand pathnames? On a real UNIX system that will usually mean a different object.Many other platforms may have case insensitive names, but will still complaintabout names with and without whitespace as different objects. I would evenexpect os.path.realname to be unreliable if handed such mangled names.

If you are sanitising some input I would recommend doing that elsewhere, andnot binding some much special knowledge into this class.

Maybe you have good reasons for that. At the least those reasons need to beexpressed in comments (best in the opening docstring), because otherwise Iwould recommend strongly to make the cache class as naive about names aspossible (i.e. do not mangle them).

           data["parentdirectory"] = os.sep.join(\
                   os.path.split(scan_directory)[0:-1])


Use os.path.dirname.

           data["st_mode"] = st[ST_MODE]
           data["st_inode"] = st[ST_INO]
           data["st_dev"] = st[ST_DEV]
           data["st_nlink"] = st[ST_NLINK]
           data["st_uid"] = st[ST_UID]
           data["st_gid"] = st[ST_GID]
           data["compressed"] = st[ST_SIZE]


"compressed" is worth a comment.

           data["st_size"] = st[ST_SIZE]       #10
           data["st_atime"] = st[ST_ATIME]     #11
           data["raw_st_mtime"] = st[ST_MTIME] #12
           data["st_mtime"] = time.asctime(time.localtime(st[ST_MTIME]))


I strongly recommend against this.

Keep "st_mtime" (or data.st.st_mtime per my earlier suggestion) as the rawmtime. Any human readable transcription of the time should get a name like"human_st_mtime".

Better still, if you make a Dirent class for "data", you can give it a propertylike:


  @property
  def human_st_mtime(self):
    return time.asctime(time.localtime(self.st.st_mtime))

and just use ".human.st.mtime" like any other attribute later when needed.Note: no brackets.

           data["st_ctime"] = st[ST_CTIME]
           if not x.name.strip().lower() in self.files_to_ignore:


Personally I phrase this as "a not in b" instead of "not a in b".

Also, if you really are ignoring it, you could go:

  if norm_name in self.files_to_ignore:
    # do nothing else with this entry
    continue

and save yourself some indentation below. Not everyone likes this.

   def directory_in_cache(self, scan_directory):
       """
           Pass the target directory

           Will return True if the directory is already cached
           Will return False if the directory is not already cached
       """
       scan_directory = os.path.realpath(scan_directory).lower().strip()


Probably worth making normalisation function:

  def norm_name(name):
    return name.lower().strip()

becuase you do this a lot. Using a function ensures you do it the same wayeverywhere and makes it obvious that it is the same thing with the same purposeeverywhere.

       return scan_directory in self.directory_cache.keys()


You can just say:

  return scan_directory in self.directory_cache

and for added points you can rename "directory_in_cache" to "__contains__".That way from outside (or inside) you use "in" directly. Example:


  if "foo" in self:

Cheers,
Cameron Simpson <c...@zip.com.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Directory Caching, suggestions and comments?

Reply via email to