Re: Pimp my algorithm: finding repeating subsequences

Jules Mon, 21 Feb 2011 07:21:14 -0800

Sorry that should be:

(def represent (memoize represent))


On Feb 20, 5:18 pm, Jules <julesjac...@gmail.com> wrote:
> Here is my solution that solves the problem optimally. Observe that
> given a sequence we can do one of two things: (1) represent it as a
> repetition or (2) represent it as a concatenation of repetitions. If
> we had two functions to calculate all the ways of doing (1) and (2) we
> could solve the problem as follows:
>
> (defn represent [s] (apply min-key cost (concat (reps s) (concats
> s))))
>
> That is, we pick out the minimum cost way of representing s as a
> repetition or as a concatenation of repetitions, where reps returns
> all repetitions and concats returns all concatenations.
>
> The cost function to count the number of symbols is:
>
> (defn cost [repr] (apply + (map #(count (% 1)) repr)))
>
> Of course you can use another cost function.
>
> The function reps can be defined as follows:
>
> (defn reps [s]
>   (map vector (filter #(= s (apply concat (repeat (first %) (second
> %))))
>                       (map #(vector (/ (count s) %) (take % s)) (range 1 (+ 1 
> (count
> s)))))))
>
> For example, reps on ababab gives: [3 ab] and [1 ababab]. Note that
> this algorithm is very naive: it tries all repetitions [n (take n s)]
> and eliminates invalid ones. The code may look involved but the
> algorithm is rather simplistic.
>
> The function concats is as follows:
>
> (defn concats [s]
>   (map #(apply concat (map represent (split-at % s))) (range 1 (-
> (count s) 1))))
>
> It works like this. We split s at all possible boundaries: ababab => a
> babab, ab abab, aba bab, abab ab, ababa b. Then we recursively
> calculate the optimal representation of each of the pairs, and
> concatenate the representations. This gives us all ways to represent s
> as a concatenation of optimal representations.
>
> That is all it takes :)
>
> (represent '(a b b a b b a)) => ([1 (a)] [2 (b b a)])
>
> Unfortunately the algorithm is exponential time. We can fix this by
> memoizing represent:
>
> (memoize represent)
>
> Now the algorithm is polynomial time. It is still rather slow for
> large sequences (but there is a lot of room for optimization), and
> gives stack overflows. You can fix this by eliminating the recursion
> and filling the memoization table iteratively.
>
> Jules

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Pimp my algorithm: finding repeating subsequences

Reply via email to