MRAB wrote:
James wrote:
Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.
The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.
I originally had code like this:
for t0 in token_stream:
for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
for t3 in remove_dupes(t2):
yield t3
For that to work at all, the three functions would have to turn each
token into an iterable of 0 or 1 tokens. Hence the inner 'loops' would
execute 0 or 1 times. Better to return a token or None, and replace the
three inner 'loops' with three conditional statements (ugly too) or less
efficiently (due to lack of short circuiting),
t = remove_dupes(remove_boring(lowercase_token(t0)))
if t is not None: yield t
Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.
That puzzles me. Your actual code must be slightly different from the
above and what I imagine the functions to be. But nevermind, because
Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?
MRAB's suggestion is the way to go. Your automatically get
short-circuiting because each generator only gets what is passed on.
And resuming a generator is much faster that re-calling a function.
What you should be doing is letting the filters accept an iterator and
yield values on demand:
def lowercase_token(stream):
for t in stream:
yield t.lower()
def remove_boring(stream):
for t in stream:
if t not in boring:
yield t
def remove_dupes(stream):
seen = set()
for t in stream:
if t not in seen:
yield t
seen.add(t)
def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)
for t in stream(t):
yield t
I also recommend the Beazly reference Herron gave.
tjr
--
http://mail.python.org/mailman/listinfo/python-list