On 2013-12-18 03:51, Denis McMahon wrote: > I need to keep the timestamp / data association because I need to > generate output that identifies (a) the longest repeated sequence > (b) how many elements in the longest repeated sequence (c) at what > timestamps each occurrence started. > > I'm not sure of the best way, programatically, to aproach this > task, which means I'm unsure whether eg a list of tuples ( time, > data ) or an OrderedDict keyed on the timestamp is the best > starting point. [snip] > Is there a better way to do this?
You might have to define your criteria for "best". Things that might play into it: - what happens if there's more than one "longest" sequence? - can you specify a minimum length that you don't have to track? (a maximum sequence of length-1 might be uninteresting) - how large can your input data grow to be? If it's huge, you're looking at searching a problem space something like O(N^2) or possibly O(N log N) if you're lucky. Which in turn can involve lots of memory consumption. - do you just want to know what that sequence is, or do you want to know all the locations/offsets where it the longest sequence was found? I've played with two simple versions, the first is naive, while the second tries to use a little bit of smarts to eliminate possibilities when you know you have at least one match of a certain length. The result of the function does allow for finding a plurality of longest sequences, but my demo code just snags one with the max() function. I leave the "find multiple maxes" as an exercise for the reader. :) -tkc def build1(d): sequences = {} # sequence -> [list of start offsets] for i in range(len(d)): for j in range(i+1, len(d)): key = tuple(value for timestamp, value in d[i:j]) sequences.setdefault(key, []).append(i) return sequences def build2(d): sequences = {} # sequence -> [list of start offsets] longest = 1 for i in range(len(d)): for j in range(i+longest, len(d)): key = tuple(value for timestamp, value in d[i:j]) if key in sequences: sequences[key].append(i) if len(sequences[key]) > longest: longest = len(sequences[key]) print longest else: sequences[key] = [i] return sequences data = { 0: "d", 1: "x", 2: "y", 3: "t", 4: "d", 5: "y", 77: "g", 78: "h", 79: "x", 80: "y", 206: "t", 210: "d", 211: "x", } d = data.items() d.sort() for fn in (build1, build2): print fn.__name__ sequences = fn(d) longest_key, offsets = max(( pair for pair in sequences.iteritems() if len(pair[1]) > 1 ), key=lambda (k,v): len(k)) print "Longest sequence %r found at %s" % ( longest_key, [d[i][0] for i in offsets], ) for k,v in sequences.iteritems(): if len(v) > 1: print "%s: %r" % (k,v) print '=' * 50 . -- https://mail.python.org/mailman/listinfo/python-list