Shawn Milo a écrit : > I'm new to Python and fairly experienced in Perl, although that > experience is limited to the things I use daily. > > I wrote the same script in both Perl and Python, and the output is > identical. The run speed is similar (very fast) and the line count is > similar. > > Now that they're both working, I was looking at the code and wondering > what Perl-specific and Python-specific improvements to the code would > look like, as judged by others more knowledgeable in the individual > languages. > > I am not looking for the smallest number of lines, or anything else > that would make the code more difficult to read in six months. Just > any instances where I'm doing something inefficiently or in a "bad" > way. > (snip) > > #! /usr/bin/env python > > import sys > > input = sys.stdin > > recs = {} > > for row in input: > row = row.rstrip('\n') > piid = row.split('\t')[0] > if recs.has_key(piid) is False:
'is' is the identity operator - practically, in CPython, it compares memory addresses. You *dont* want to use it here. Another way to test if a key is already in a dict is to use the 'in' operator, ie: if not piid in recs: > recs[piid] = [] > recs[piid].append(row) As a last point, dicts have a setdefault(key, default) method that returns the value associated with key if it already exists, else add the key=>default pair and returns default, but it's a little bit slower. for row in input: row = row.rstrip('\n') piid = row.split('\t')[0] recs.setdefault(piid, []).append(row) or for row in input: row = row.rstrip('\n') piid = row.split('\t')[0] if piid not in recs: recs[piid] = [] recs[piid].append(row) > for piid in recs.keys(): You don't need to call key() here - it's already the default iteration for dicts. But since you want the values too, you should use dict.items() or dict.iteritems(): for piid, rows in recs.iteritems() > best = "" > for current in recs[piid]: > if best == "": > best = current; Get rid of this ";" !-) Better to use None - it's the "no value" object, and it let you use an identity test: best = None for current in rows: if best is None: best = row And while we're at it, you can save yourself a test here: best = row[0] for current in row[1:]: # start tests > #If the current record is the correct state > if current.split("\t")[1] == current.split("\t")[6]: > #If the existing record is the correct state > if best.split("\t")[1] == best.split("\t")[6]: > #If the new record has a newer exp. date > if current.split("\t")[5] > best.split("\t")[5]: You're repeatingly calling split() on the same objects. This is a serious waste of time and CPU. > best = current > else: > best = current > else: > #If the existing record does not have the correct state > #and the new record has a newer exp. date > if best.split("\t")[1] != best.split("\t")[6] and > current.split("\t")[5] > best.split("\t")[5]: > best = current > > print best Here's a somewhat corrected version: # --- import sys def findbests(input=sys.stdin, output=sys.stdout): DATE = 5 TARGET = 6 STATE = 1 recs = {} for row in input: row = row.rstrip('\n').split("\t") piid = row[0] if piid not in recs: recs[piid] = [] recs[piid].append(row) for piid, rows in recs.iteritems(): best = rows[0] for current in rows[1:]: if current[STATE] == current[TARGET]: if best[STATE] == best[TARGET]: if current[DATE] > best[DATE]: best = current else: best = current elif best[STATE] != best[TARGET] \ and current[DATE] > best[DATE]: best = current print >> output, "\t".join(best) if __name__ == '__main__': findbdest() # --- It's somewhat shorter, a bit more readable IMHO, and somewhat faster too (=~ 30/40% faster on my machine). Also, it's usable as both a program and an importable module, and with any line input and file-like output. Now for the bad news: I'm afraid your algorithm is broken : here are my test data and results: input = [ #ID STATE ... ... ... TARG DATE "aaa\tAAA\t...\t...\t...\tBBB\t20071212\n", "aaa\tAAA\t...\t...\t...\tAAA\t20070120\n", "aaa\tAAA\t...\t...\t...\tAAA\t20070101\n", "aaa\tAAA\t...\t...\t...\tBBB\t20071010\n", "aaa\tAAA\t...\t...\t...\tBBB\t20071111\n", "ccc\tAAA\t...\t...\t...\tBBB\t20071201\n", "ccc\tAAA\t...\t...\t...\tAAA\t20070101\n", "ccc\tAAA\t...\t...\t...\tBBB\t20071212\n", "ccc\tAAA\t...\t...\t...\tAAA\t20071212\n", # oops ! "bbb\tAAA\t...\t...\t...\tAAA\t20070101\n", "bbb\tAAA\t...\t...\t...\tAAA\t20070101\n", "bbb\tAAA\t...\t...\t...\tAAA\t20071212\n", "bbb\tAAA\t...\t...\t...\tAAA\t20070612\n", "bbb\tAAA\t...\t...\t...\tBBB\t20071212\n", ] => aaa AAA ... ... ... BBB 20071212 bbb AAA ... ... ... BBB 20071212 ccc AAA ... ... ... BBB 20071201 At least for piid 'ccc', there's a record with matching state and newest date. (exact same result as with your original version). HTH -- http://mail.python.org/mailman/listinfo/python-list