On Jan 31, 2:44 pm, Peter Otten <__pete...@web.de> wrote: > Kyp wrote: > > I have a dir with a large # of files that I need to perform operations > > on, but only needing to access a subset of the files, i.e. the first > > 100 files. > > > Using glob is very slow, so I ran across iglob, which returns an > > iterator, which seemed just like what I wanted. I could iterate over > > the files that I wanted, not having to read the entire dir. > > > So the iglob was faster, but accessing the first file took about the > > same time as glob.glob. > > > Here's some code to compare glob vs. iglob performance, it outputs > > the time before/after a glob.iglob('*.*') files.next() sequence and a > > glob.glob('*.*') sequence. > > > #!/usr/bin/env python > > > import glob,time > > print '\nTest of glob.iglob' > > print 'before iglob:', time.asctime() > > files = glob.iglob('*.*') > > print 'after iglob:',time.asctime() > > print files.next() > > print 'after files.next():', time.asctime() > > > print '\nTest of glob.glob' > > print 'before glob:', time.asctime() > > files = glob.glob('*.*') > > print 'after glob:',time.asctime() > > > Here are the results: > > > Test of glob.iglob > > before iglob: Sun Jan 31 11:09:08 2010 > > after iglob: Sun Jan 31 11:09:08 2010 > > foo.bar > > after files.next(): Sun Jan 31 11:09:59 2010 > > > Test of glob.glob > > before glob: Sun Jan 31 11:09:59 2010 > > after glob: Sun Jan 31 11:10:51 2010 > > > The results are about the same for the 2 approaches, both took about > > 51 seconds. Am I doing something wrong with iglob? > > No, but iglob() being lazy is pointless in your case because it uses > os.listdir() and fnmatch.filter() underneath which both read the whole > directory before returning anything. > > > Is there a way to get the first X # of files from a dir with lots of > > files, that does not take a long time to run? > > Here's my attempt. It turned out to be more work than expected, so I cut a > few corners. It's Linux-only "works on my machine" code, but may give you > some hints on how to proceed. > > from ctypes import * > import fnmatch > import glob > import os > import re > from itertools import ifilter, imap > > class dirent(Structure): > "works on my machine ;)" > _fields_ = [ > ("d_ino", c_long), > ("d_off", c_long), > ("d_reclen", c_ushort), > ("d_type", c_ubyte), > ("d_name", c_char*256)] > > direntp = POINTER(dirent) > > LIBC = "libc.so.6" > cdll.LoadLibrary(LIBC) > libc = CDLL(LIBC) > libc.readdir.restype = direntp > > def diriter(dir): > "lazy partial replacement for os.listdir()" > # errors? what errors? > dirp = libc.opendir(dir) > if not dirp: > return > try: > while True: > ep = libc.readdir(dirp) > if not ep: > break > yield ep.contents.d_name > finally: > libc.closedir(dirp) > > def filter(names, pattern): > "lazy partial replacement for fnmatch.filter()" > import posixpath > > pattern = os.path.normcase(pattern) > r = fnmatch.translate(pattern) > r = re.compile(r) > > if os.path is not posixpath: > names = imap(os.path.normcase, names) > > return ifilter(r.match, names) > > def globiter(path): > "lazy partial replacement for glob.glob()" > dir, filename = os.path.split(path) > if glob.has_magic(dir): > raise ValueError("wildcards in directory not supported") > return filter(diriter(dir), filename) > > if __name__ == "__main__": > import sys > [pattern] = sys.argv[1:] > for name in globiter(pattern): > print name > > Peter
I'll give it a try, thanx for the reply. mark -- http://mail.python.org/mailman/listinfo/python-list