Hi: I'm pretty new to python and I have some optimization issues. I'll show you the piece of code which is causing it, with pseudo-code before it and comments. I'm accessing a gigantic table (like 15 million rows) in SQL.
d is some dictionary, r is a precompiled regex string Big loop, so I search through the table in chunks given by delta SQL query ("select * from table where rowID >= n and rowID < (n + delta)"), result of query stored in a. Each individual row is a[n1], columns of rows are a[n1][n2]. t1 = time.clock() #to track speed for m in a: for temp in m: if str(temp) == "None": #basically skip over the columns that are null for this particular row continue s += temp #get the columns into one long string s = s.replace("between", "") s = s.replace("and", "") s = s.replace("where", "") s = s.replace("like", "") #these words cause problems, need to get rid of them. b = re.findall(r,s) #looking for the stuff I want, always at least one per row of table, about 3-4 on average. for t in b: #store count of things I want in dictionary if t in d: d[t] += 1 else: d[t] = 1 print n, (time.clock()-t1) #to track speed I am 100% sure it's this code snippet that's the cause of my problems. Here's what I can tell you. Each chunk of rows that I grab is essentially equal in size (rowID skips over stuff, but rather arbitrarily). The time it takes to fetch the SQL query doesn't change. But as the program progresses, this snippet gets slower. Here's the output: 2500 0.441551299341 5000 1.26162739664 7500 2.35092688403 10000 3.48417469666 12500 4.59031305491 15000 5.78972588775 17500 6.28305527139 20000 6.73344570903 22500 8.31732146487 25000 9.65322872159 27500 8.98186042757 30000 11.8042818095 32500 12.1965593712 35000 13.2735763291 37500 14.0282617344 What is it in the code snippet that slows down as n increases? Is there something about the way low level python functions I don't understand which is slowing me down? Thanks in advance for your time. -Wei
-- http://mail.python.org/mailman/listinfo/python-list