On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote: > On 12/21/2012 03:36 PM, larry.mart...@gmail.com wrote: > > > On Friday, December 21, 2012 10:57:19 AM UTC-7, larry....@gmail.com wrote: > > >> <snip> > > >> Didn't know about bisect. Thanks. I thought it would be my savior for > >> sure. But unfortunaly when I added that, it blows up with out of memory. > > > The out of memory error had nothing to do with using bisect. I had > > introduced a typo that I really though would have caused a variable > > referenced before assignment error. But it did not do that, and instead > > somehow caused all the memory in my machine to get used up. When I fixed > > that, it worked really well with bisect. The code that was taking 2 hours > > was down to 20 minutes, and even better, a query that was taking 40 minutes > > was down to 8 seconds. > > > > > > Thanks very much for all your help. > > Congratulations. But you're not done. A fourfold improvement isn't > nearly as much as I'd expect. When you go from a n-squared algorithm to > an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold > improvement. Assuming we're talking just about time spent in the cdata > = list-comprehension list. > > It's also probably possible to simplify, and get some speedup from other > parts of the code other than that one loop. For example, if the bisect > is really working right, it's probably unnecessary to even copy the > original cdata. You said it was sorted. So bisect it directly, and > skip the messageTimes intermediate data structure. It may not help, but > i suspect it will. > > > > >> This was the code I had: > >> > >> times = messageTimes[tup[0]] > > >> > > >> le = bisect.bisect_right(times, tup[1]) > >> ge = bisect.bisect_left(times, tup[1]) > >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and > >> times[ge]-tup[1] <= tdiff) > > > > I'm stymied making further suggestions to this fragment. Without seeing > the changes you apparently made to messageTimes, I can't even be sure > this is equivalent. And I wonder if even looking up tup[1] is > worthwhile, since your caller already knows the index in cdata. If you > used cdata directly, you might not need a sort at all. > > I wish you could post some sample data, and identify which records are > intended to be True. Obviously you could simplify, and just use ints > where it's using datetimes here. But if your rule is simply accept all > records that come in a cluster with no delta more than a threshold, then > you don't even need any search at all. Just pass the index in, and > compare the datetime with its predecessor and successor. > > Could we see the whole fragment as you originally had it, but with the > optimizations you've done so far? The more I look at that determine() > function, the more futile it seems. I just don't understand what the > criteria are supposed to be. Your list-comp loops through all of cdata, > and for each item, passes that item to determine(). determine() loops > through a copy of cdata, checking to see if any of them is within tdiff > of tup. But won't that always return True, since you'll encounter the > record and compare it to itself?
I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold. cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match. Here's my original code: # record time for each message matching the specified message for each tool messageTimes = {} for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0], row[1]] = 1 # now pull out each message that is within the time diff for each matched message # as well as the matched messages themselves def determine(tup): if self.message in tup[2]: return True # matched message for (tool, date_time) in messageTimes: if tool == tup[0]: if abs(date_time-tup[1]) <= tdiff: return True return False cdata[:] = [tup for tup in cdata if determine(tup)] Here's the code now: # Parse data and make a list of the time for each message matching the specified message for each tool messageTimes = defaultdict(list) # a dict with sorted lists for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0]].append(row[1]) # now pull out each message that is within the time context for each matched message # as well as the matched messages themselves # return true is we should keep this message def determine(tup): if self.message in tup[2]: return True # matched message if seconds == 0: return False # no time context specified times = messageTimes[tup[0]] # get the list of matched messages for this tool le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1] ge = bisect.bisect_left(times, tup[1]) # find time greater then to equal to tup[1] return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) cdata = [tup for tup in cdata if determine(tup)] In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. I hope I've explained this better. Thanks again. -- http://mail.python.org/mailman/listinfo/python-list