Michael Spencer wrote: > George Sakkis wrote: > > Michael Spencer wrote: > > > >> Here's a small update to the generator that allows optional handling of > >> the head > >> and the tail: > >> > >> def chunker(s, chunk_size=3, sentry=".", keep_first = False, keep_last = > >> False): > >> buffer=[] > ... > > > > And here's a (probably) more efficient version, using a deque as a > > buffer: > > > > Perhaps the deque-based solution is more efficient under some conditions, but > it's significantly slower for all the cases I tested:
First off, if you're going to profile something, better use the standard timeit module rather than a quick, dirty and most likely buggy handmade function. From "Dive into python": "The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.". As it turns out, none of chunk_size, words_per_group and word_length are taken into account in your tests; they all have their default values. By the way, word_length is irrelevant since the input is already a sequence, not a big string to be split. Second, the output of the two functions is different, so you're not comparing apples to apples: >>> s = " this . is a . test to . check if it . works . well . it looks . like >>> ." >>> for c in chunkerMS(s.split(), keep_last=True, keep_first=True): print c ... ['this', '.'] ['this', '.', 'is', 'a', '.'] ['this', '.', 'is', 'a', '.', 'test', 'to', '.'] ['is', 'a', '.', 'test', 'to', '.', 'check', 'if', 'it', '.'] ['test', 'to', '.', 'check', 'if', 'it', '.', 'works', '.'] ['check', 'if', 'it', '.', 'works', '.', 'well', '.'] ['works', '.', 'well', '.', 'it', 'looks', '.'] ['well', '.', 'it', 'looks', '.', 'like', '.'] ['it', 'looks', '.', 'like', '.'] ['like', '.'] >>> for c in chunkerGS(s.split(), keep_last=True, keep_first=True): print c ... ['this'] ['this', 'is a'] ['this', 'is a', 'test to'] ['is a', 'test to', 'check if it'] ['test to', 'check if it', 'works'] ['check if it', 'works', 'well'] ['works', 'well', 'it looks'] ['well', 'it looks', 'like'] ['it looks', 'like'] ['like'] Third, and most important for the measured difference, is that the performance hit in my function came from joining the words of each group (['check', 'if', 'it'] -> 'check if it') every time it is yielded. If the groups are left unjoined as in Michael's version, the results are quite different: * chunk_size=3 chunkerGS: 0.98 seconds chunkerMS: 1.04 seconds * chunk_size=30 chunkerGS: 0.98 seconds chunkerMS: 1.17 seconds * chunk_size=300 chunkerGS: 0.98 seconds chunkerMS: 2.44 seconds As expected, the deque solution is not affected by chunk_size. Here is the full test: # chunkers.py from itertools import islice from collections import deque def chunkerMS(s, chunk_size=3, sentry=".", keep_first = False, keep_last = False): buffer=[] sentry_count = 0 for item in s: buffer.append(item) if item == sentry: sentry_count += 1 if sentry_count < chunk_size: if keep_first: yield buffer else: yield buffer del buffer[:buffer.index(sentry)+1] if keep_last: while buffer: yield buffer del buffer[:buffer.index(sentry)+1] def chunkerGS(seq, sentry='.', chunk_size=3, keep_first=False, keep_last=False): buf = deque() iterchunks = itersplit(seq,sentry) for chunk in islice(iterchunks, chunk_size-1): buf.append(chunk) if keep_first: yield buf for chunk in iterchunks: buf.append(chunk) yield buf buf.popleft() if keep_last: while buf: yield buf buf.popleft() def itersplit(seq, sentry='.'): # split the iterable `seq` on each `sentry` buf = [] for x in seq: if x != sentry: buf.append(x) else: yield buf buf = [] if buf: yield buf def make_seq(groups=1000, words_per_group=3, sentry='.'): group = ['w']*words_per_group group.append(sentry) return group*groups if __name__ == '__main__': import timeit for chunk_size in 3,30,300: print "* chunk_size=%d" % chunk_size for func in chunkerGS,chunkerMS: name = func.__name__ t = timeit.Timer( "for p in %s(s, chunk_size=%d, keep_last=True," "keep_first=True):pass" % (name,chunk_size), "from chunkers import make_seq,chunkerGS,chunkerMS;" "s=make_seq(groups=5000, words_per_group=500)") print "%s: %.2f seconds" % (name, min(t.repeat(3,1))) -- http://mail.python.org/mailman/listinfo/python-list