Re: Pimp my algorithm: finding repeating subsequences

Andy Fingerhut Sun, 13 Feb 2011 11:42:00 -0800


On Feb 13, 2011, at 10:11 AM, Mark Fredrickson wrote:

Hello friends,

I am writing a program to generate directions for humans to read. The
directions are composed of a collection with a series of steps, some
of which may be duplicated:

[:a :b :c :a :b :c]

I would like to compress them by indicating repeats, using this
notation:

[[2 [:a :b :c]]

Since the directions will be read by humans (after some formatting), I
am limiting nesting to one level. So the following would be an invalid
transformation:

[:a :b :a :b :c :a :b :a :b :c] => [[2 [2 [:a :b]] :c]]

My current algorithm for the compress has O(2^n)  complexity, which
makes it useless except for the most trivial cases. I'm looking for
something that runs in polynomial time or better. To give some
context, my current algorithm is as follows:

1. Generate the set indices for the power set of the directions (e.g.
(powerset [:a :b :c]) => [[] [1 0 0] [1 1 0] ...]).
2. For each index, use partition-by to generate a set of groups of the
vector (e.g. [1 1 0] => [[:a :b] [:c]]). Since [1 1 0] and [0 0 1] are
identical for my needs, take the distinct groupings (only generating
the distinct groups would only be O(2^{n-1]), so I haven't bothered to
shave computations here).
3. For each grouping, merge identical groups from left to right.
4. Score all the groupings by counting the number of steps (symbols in
these examples). The grouping with the fewest steps is the "best".

I've looked at some string searching algorithms (Boyer-Moore, e.g.),
but these seem geared towards finding a specific substring in a larger
string. Perhaps these algorithms would apply to my problem, but I'm
just not seeing it.

Thanks in advance for your help. All suggestions are welcome.

-Mark

Here are some ideas that won't necessarily give an optimal answer,i.e. a shortest one, but sometimes it will, and even when it doesn'tit should give something reasonable.


There are O(n^2) subsequences of your original sequence.

Step 1. For each one, you can in linear time determine whether thatsubsequence repeats immediately afterwards, and if so, how many timesit repeats. Calculate those answers R[i,j] for each of the O(n^2)subsequences and save them, e.g. in a 2-d array indexed by startposition i and end position j. That should take O(n^3) time in theworst case, but often closer to O(n^2) depending upon the sequence youhave.

Example: I'll use just strings of characters to represent thesequences, to save typing. So your sequence[:a :b :a :b :c :a :b :a :b :c] above I will write as: ababcababc

If we use [i,j] to represent a subsequence starting at index i (0-based, so the first element of the sequence is at 0) and up to but notincluding the element at index j, then [1,4] represents thesubsequence "bab".


Here is part of the table calculated in part 1:

R[0,1] 1 (i.e. "a" repeats 1 time starting at index 0, consecutively.It does not immediately repeat after that)

R[0,2] 2 (i.e. "ab" repeats 2 times starting at index 0, consecutively.)
R[0,3] 1 ("aba" appears only 1 time starting at index 0, consecutively.)
R[0,4] 1
R[0,5] 2
R[0,6] 1
etc.

Step 2. Now go over every subsequence again, this time looking for theshortest way to represent that subsequence using your notation. Thiswill depend upon exactly what you consider the quantitative "savings"is for your repetition notation. In this example, I'll use the numberof characters used to write down the notation.


So ababcababc has "cost" 10

but 2[ababc] has cost 8 (I'm counting the digits and the squarebrackets, too -- you can choose differently in whatever algorithm youchoose to implement)

For [0,10], we can either represent it as its original 10-elementsequence (cost 10), or maybe as 2 repetitions of [0,5], or maybe as 5repetitions of [0,2]. In general, a sequence of length N can eitherbe represented as itself, or for each value K that evenly divides N,as (N/K) repetitions of the sequence of length K at its beginning.For each one of those possibilities, use table R to tell in constanttime which of those is possible and which is not.


[0,10] - cost 10
2 repetitions of [0,5]? - Yes, this is possible, because R[0,5] >= 2.
5 repetitions of [0,2]? - No, impossible, because R[0,2] < 5.

So for [0,10], there are two ways to represent it: cost 10 in itsoriginal unrepeated form, or cost 8 for "2[ababc]".

Let us make a table S[i,j] for all subsequences that tells us how muchwe can save by representing [i,j] (the whole thing, not just part ofit) as a repetition.

So S[i,j]=10-8=2, because we can save 2 characters by using therepetition 2[ababc] in place of writing out the full sequence.


Part of table S would look like this:

S[0,1]=0   No shorter way than writing out "ab", so no savings

S[0,2]=0 No way to write _all_ of "aba" as a repetition, so nosavings possible.S[0,3]=0 We could write "abab" as that or as "2[ab]", but both havecost 4, so no savings possible.

S[0,4]=0  Same as S[0,2]
S[0,5]=0
S[0,6]=0
S[0,7]=0
S[0,8]=0
S[0,9]=0
S[0,10]=2

I believe that step 2 would take at most O(n^3) time as well, if youprecomputed a table of factors once.

Step 3. Steps 1 and 2 give exact answers for what they calculate, butin this step, I don't yet see an efficient algorithm that isguaranteed to find a "least cost" way of representing the originalsequence. Maybe someone else will.

You start with the whole sequence, and do a greedy algorithm on thetable S looking for big savings. So you find the largest number in S,and say "I'm going to use that cost savings to represent that part ofthe original sequence, since it saves so much". Let's say you pickedS[3,9] because it was a largest element of S. After that, you cannotpick any subsequence of [3,9], or any subsequence that overlaps with[3,9] at all, and use their savings, because you can't overlap or nestrepetitions. So you look for the largest savings possible among allremaining subsequences that do not overlap [3,9]. Keep doing thatuntil you've completely covered the original sequence, or until theonly remaining choices give 0 savings.


Andy


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Pimp my algorithm: finding repeating subsequences

Reply via email to