On Feb 13, 2011, at 10:11 AM, Mark Fredrickson wrote:
Hello friends,
I am writing a program to generate directions for humans to read. The
directions are composed of a collection with a series of steps, some
of which may be duplicated:
[:a :b :c :a :b :c]
I would like to compress them by indicating repeats, using this
notation:
[[2 [:a :b :c]]
Since the directions will be read by humans (after some formatting), I
am limiting nesting to one level. So the following would be an invalid
transformation:
[:a :b :a :b :c :a :b :a :b :c] => [[2 [2 [:a :b]] :c]]
My current algorithm for the compress has O(2^n) complexity, which
makes it useless except for the most trivial cases. I'm looking for
something that runs in polynomial time or better. To give some
context, my current algorithm is as follows:
1. Generate the set indices for the power set of the directions (e.g.
(powerset [:a :b :c]) => [[] [1 0 0] [1 1 0] ...]).
2. For each index, use partition-by to generate a set of groups of the
vector (e.g. [1 1 0] => [[:a :b] [:c]]). Since [1 1 0] and [0 0 1] are
identical for my needs, take the distinct groupings (only generating
the distinct groups would only be O(2^{n-1]), so I haven't bothered to
shave computations here).
3. For each grouping, merge identical groups from left to right.
4. Score all the groupings by counting the number of steps (symbols in
these examples). The grouping with the fewest steps is the "best".
I've looked at some string searching algorithms (Boyer-Moore, e.g.),
but these seem geared towards finding a specific substring in a larger
string. Perhaps these algorithms would apply to my problem, but I'm
just not seeing it.
Thanks in advance for your help. All suggestions are welcome.
-Mark
Here are some ideas that won't necessarily give an optimal answer,
i.e. a shortest one, but sometimes it will, and even when it doesn't
it should give something reasonable.
There are O(n^2) subsequences of your original sequence.
Step 1. For each one, you can in linear time determine whether that
subsequence repeats immediately afterwards, and if so, how many times
it repeats. Calculate those answers R[i,j] for each of the O(n^2)
subsequences and save them, e.g. in a 2-d array indexed by start
position i and end position j. That should take O(n^3) time in the
worst case, but often closer to O(n^2) depending upon the sequence you
have.
Example: I'll use just strings of characters to represent the
sequences, to save typing. So your sequence
[:a :b :a :b :c :a :b :a :b :c] above I will write as: ababcababc
If we use [i,j] to represent a subsequence starting at index i (0-
based, so the first element of the sequence is at 0) and up to but not
including the element at index j, then [1,4] represents the
subsequence "bab".
Here is part of the table calculated in part 1:
R[0,1] 1 (i.e. "a" repeats 1 time starting at index 0, consecutively.
It does not immediately repeat after that)
R[0,2] 2 (i.e. "ab" repeats 2 times starting at index 0, consecutively.)
R[0,3] 1 ("aba" appears only 1 time starting at index 0, consecutively.)
R[0,4] 1
R[0,5] 2
R[0,6] 1
etc.
Step 2. Now go over every subsequence again, this time looking for the
shortest way to represent that subsequence using your notation. This
will depend upon exactly what you consider the quantitative "savings"
is for your repetition notation. In this example, I'll use the number
of characters used to write down the notation.
So ababcababc has "cost" 10
but 2[ababc] has cost 8 (I'm counting the digits and the square
brackets, too -- you can choose differently in whatever algorithm you
choose to implement)
For [0,10], we can either represent it as its original 10-element
sequence (cost 10), or maybe as 2 repetitions of [0,5], or maybe as 5
repetitions of [0,2]. In general, a sequence of length N can either
be represented as itself, or for each value K that evenly divides N,
as (N/K) repetitions of the sequence of length K at its beginning.
For each one of those possibilities, use table R to tell in constant
time which of those is possible and which is not.
[0,10] - cost 10
2 repetitions of [0,5]? - Yes, this is possible, because R[0,5] >= 2.
5 repetitions of [0,2]? - No, impossible, because R[0,2] < 5.
So for [0,10], there are two ways to represent it: cost 10 in its
original unrepeated form, or cost 8 for "2[ababc]".
Let us make a table S[i,j] for all subsequences that tells us how much
we can save by representing [i,j] (the whole thing, not just part of
it) as a repetition.
So S[i,j]=10-8=2, because we can save 2 characters by using the
repetition 2[ababc] in place of writing out the full sequence.
Part of table S would look like this:
S[0,1]=0 No shorter way than writing out "ab", so no savings
S[0,2]=0 No way to write _all_ of "aba" as a repetition, so no
savings possible.
S[0,3]=0 We could write "abab" as that or as "2[ab]", but both have
cost 4, so no savings possible.
S[0,4]=0 Same as S[0,2]
S[0,5]=0
S[0,6]=0
S[0,7]=0
S[0,8]=0
S[0,9]=0
S[0,10]=2
I believe that step 2 would take at most O(n^3) time as well, if you
precomputed a table of factors once.
Step 3. Steps 1 and 2 give exact answers for what they calculate, but
in this step, I don't yet see an efficient algorithm that is
guaranteed to find a "least cost" way of representing the original
sequence. Maybe someone else will.
You start with the whole sequence, and do a greedy algorithm on the
table S looking for big savings. So you find the largest number in S,
and say "I'm going to use that cost savings to represent that part of
the original sequence, since it saves so much". Let's say you picked
S[3,9] because it was a largest element of S. After that, you cannot
pick any subsequence of [3,9], or any subsequence that overlaps with
[3,9] at all, and use their savings, because you can't overlap or nest
repetitions. So you look for the largest savings possible among all
remaining subsequences that do not overlap [3,9]. Keep doing that
until you've completely covered the original sequence, or until the
only remaining choices give 0 savings.
Andy
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en