Here's my version (not tested much). Main differences from yours: 1. It defines a Python class to hold row data, and defines the __cmp__ operation on the class, so given two Row objects r1,r2, you can say simply if r1 > r2: ... to see which is "better".
2. Instead of reading all the rows into memory and then scanning the list of records of each piid, it simply remembers the best it has seen for each piid. By putting the "better than" logic into the class definition, the main loop becomes very simple. It does parse out and store fields on the Row objects consuming some extra memory, but you could eliminate that at the cost of a little code and speed by re-parsing as needed in the comparison function. ================================================================ #! /usr/bin/env python import sys class Row: def __init__(self, row): self.row = row.rstrip('\n') fields = self.row.split('\t') self.piid = fields[0] self.state = fields[1] self.expiration_date = fields[5] self.desired_state = fields[6] def __cmp__(self, other): # return +1 if self is better than other, -1 if other is better # than self, or 0 if they are equally good if self.state == self.desired_state: if other.state != other.desired_state: return 1 return cmp(self.expiration_date, other.expiration_date) elif other.expiration_date > self.expiration_date: # other record is better only if its exp date is newer return 1 return 0 best = {} input = sys.stdin for row in input: r = Row(row) if r.piid not in best or r > best[r.piid]: best[r.piid] = r for piid,r in best.iteritems(): print r.row -- http://mail.python.org/mailman/listinfo/python-list