On Thursday 21 April 2016 16:35, Michael Selik wrote: > On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <st...@pearwood.info> > wrote: > >> I want to group [repeated] subsequences. For example, I have: >> "ABCABCABCDEABCDEFABCABCABCB" >> and I want to group it into repeating subsequences. I can see two >> ways... How can I do this? Does this problem have a standard name and/or >> solution? >> > > I'm not aware of a standard name. This sounds like an unsupervised > learning problem. There's no objectively correct answer unless you add > more specificity to the problem statement. > > Regexes may sound tempting at first,
Ah, I may have mislead you all. I cannot use regexes, since the *actual* sequences I'm working on are sequences (lists, tuples, etc) or iterators of arbitrary items. The items themselves may be strings, but the sequence itself is definitely not a string. I just showed a string for convenience. Sorry about that. So no regexes. > but because a repeating subsequence > may have nested repeating subsequences and this can go on infinitely, I > think we at least need a push-down automata. Fortunately, for my *immediate* problem, I would be good with some fairly arbitrary restrictions on subsequence detection. > I checked out some links for clustering algorithms that work on series > subsequences and I found some fun results. > > Clustering is meaningless! > http://www.cs.ucr.edu/~eamonn/meaningless.pdf > > I think you're in "no free lunch" territory. "Clustering of subsequence > time series remains an open issue in time series clustering" > http://www.hindawi.com/journals/tswj/2014/312521/ > > Any more detail on the problem to add constraints? The specific problem I am trying to solve is that I have a sequence of strings (in this case, error messages from a Python traceback) and I'm looking for repeated groups that may indicate mutually recursive calls. E.g. suppose I have a function f which calls g, and g calls h, and h calls f again, and there's an exception, you will see a traceback in part: File "<stdin>", line 2, in f File "<stdin>", line 5, in g File "<stdin>", line 9, in h File "<stdin>", line 2, in f File "<stdin>", line 5, in g File "<stdin>", line 9, in h File "<stdin>", line 2, in f File "<stdin>", line 5, in g File "<stdin>", line 9, in h File "<stdin>", line 7, in f File "<stdin>", line 5, in g File "<stdin>", line 9, in h etc. Note that I only care about lines which are identical, e.g. if the line numbers differ (as in the last call to f), they will be treated as distinct items. So I'd like to group the above as: File "<stdin>", line 2, in f File "<stdin>", line 5, in g File "<stdin>", line 9, in h *** above 3 calls repeated 3 times *** File "<stdin>", line 7, in f File "<stdin>", line 5, in g File "<stdin>", line 9, in h -- Steve -- https://mail.python.org/mailman/listinfo/python-list