On Feb 13, 2011, at 10:11 AM, Mark Fredrickson wrote:

Hello friends,

I am writing a program to generate directions for humans to read. The
directions are composed of a collection with a series of steps, some
of which may be duplicated:

[:a :b :c :a :b :c]

I would like to compress them by indicating repeats, using this
notation:

[[2 [:a :b :c]]

Since the directions will be read by humans (after some formatting), I
am limiting nesting to one level. So the following would be an invalid
transformation:

[:a :b :a :b :c :a :b :a :b :c] => [[2 [2 [:a :b]] :c]]

My current algorithm for the compress has O(2^n)  complexity, which
makes it useless except for the most trivial cases. I'm looking for
something that runs in polynomial time or better. To give some
context, my current algorithm is as follows:

1. Generate the set indices for the power set of the directions (e.g.
(powerset [:a :b :c]) => [[] [1 0 0] [1 1 0] ...]).
2. For each index, use partition-by to generate a set of groups of the
vector (e.g. [1 1 0] => [[:a :b] [:c]]). Since [1 1 0] and [0 0 1] are
identical for my needs, take the distinct groupings (only generating
the distinct groups would only be O(2^{n-1]), so I haven't bothered to
shave computations here).
3. For each grouping, merge identical groups from left to right.
4. Score all the groupings by counting the number of steps (symbols in
these examples). The grouping with the fewest steps is the "best".

I've looked at some string searching algorithms (Boyer-Moore, e.g.),
but these seem geared towards finding a specific substring in a larger
string. Perhaps these algorithms would apply to my problem, but I'm
just not seeing it.

Thanks in advance for your help. All suggestions are welcome.

-Mark

Here are some ideas that won't necessarily give an optimal answer, i.e. a shortest one, but sometimes it will, and even when it doesn't it should give something reasonable.

There are O(n^2) subsequences of your original sequence.

Step 1. For each one, you can in linear time determine whether that subsequence repeats immediately afterwards, and if so, how many times it repeats. Calculate those answers R[i,j] for each of the O(n^2) subsequences and save them, e.g. in a 2-d array indexed by start position i and end position j. That should take O(n^3) time in the worst case, but often closer to O(n^2) depending upon the sequence you have.

Example: I'll use just strings of characters to represent the sequences, to save typing. So your sequence [:a :b :a :b :c :a :b :a :b :c] above I will write as: ababcababc

If we use [i,j] to represent a subsequence starting at index i (0- based, so the first element of the sequence is at 0) and up to but not including the element at index j, then [1,4] represents the subsequence "bab".

Here is part of the table calculated in part 1:

R[0,1] 1 (i.e. "a" repeats 1 time starting at index 0, consecutively. It does not immediately repeat after that)
R[0,2] 2 (i.e. "ab" repeats 2 times starting at index 0, consecutively.)
R[0,3] 1 ("aba" appears only 1 time starting at index 0, consecutively.)
R[0,4] 1
R[0,5] 2
R[0,6] 1
etc.

Step 2. Now go over every subsequence again, this time looking for the shortest way to represent that subsequence using your notation. This will depend upon exactly what you consider the quantitative "savings" is for your repetition notation. In this example, I'll use the number of characters used to write down the notation.

So ababcababc has "cost" 10
but 2[ababc] has cost 8 (I'm counting the digits and the square brackets, too -- you can choose differently in whatever algorithm you choose to implement)

For [0,10], we can either represent it as its original 10-element sequence (cost 10), or maybe as 2 repetitions of [0,5], or maybe as 5 repetitions of [0,2]. In general, a sequence of length N can either be represented as itself, or for each value K that evenly divides N, as (N/K) repetitions of the sequence of length K at its beginning. For each one of those possibilities, use table R to tell in constant time which of those is possible and which is not.

[0,10] - cost 10
2 repetitions of [0,5]? - Yes, this is possible, because R[0,5] >= 2.
5 repetitions of [0,2]? - No, impossible, because R[0,2] < 5.

So for [0,10], there are two ways to represent it: cost 10 in its original unrepeated form, or cost 8 for "2[ababc]".

Let us make a table S[i,j] for all subsequences that tells us how much we can save by representing [i,j] (the whole thing, not just part of it) as a repetition.

So S[i,j]=10-8=2, because we can save 2 characters by using the repetition 2[ababc] in place of writing out the full sequence.

Part of table S would look like this:

S[0,1]=0   No shorter way than writing out "ab", so no savings
S[0,2]=0 No way to write _all_ of "aba" as a repetition, so no savings possible. S[0,3]=0 We could write "abab" as that or as "2[ab]", but both have cost 4, so no savings possible.
S[0,4]=0  Same as S[0,2]
S[0,5]=0
S[0,6]=0
S[0,7]=0
S[0,8]=0
S[0,9]=0
S[0,10]=2

I believe that step 2 would take at most O(n^3) time as well, if you precomputed a table of factors once.


Step 3. Steps 1 and 2 give exact answers for what they calculate, but in this step, I don't yet see an efficient algorithm that is guaranteed to find a "least cost" way of representing the original sequence. Maybe someone else will.

You start with the whole sequence, and do a greedy algorithm on the table S looking for big savings. So you find the largest number in S, and say "I'm going to use that cost savings to represent that part of the original sequence, since it saves so much". Let's say you picked S[3,9] because it was a largest element of S. After that, you cannot pick any subsequence of [3,9], or any subsequence that overlaps with [3,9] at all, and use their savings, because you can't overlap or nest repetitions. So you look for the largest savings possible among all remaining subsequences that do not overlap [3,9]. Keep doing that until you've completely covered the original sequence, or until the only remaining choices give 0 savings.

Andy


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to