Re: Pimp my algorithm: finding repeating subsequences

Mark Fredrickson Sat, 19 Feb 2011 14:10:55 -0800

Very interesting solution. I implemented an LZ77 like solution, which
appears to be working, but its good to have alternative solutions. As
you point out, different match patterns might be interesting. I tune
the matching by disallowing small matches based on length, but the
pattern string would be more flexible than hard coded logic rules.


Thanks,
-M


On Feb 19, 1:05 pm, Tyler Perkins <thinks.outs...@gmail.com> wrote:
> Interesting. This problem is like compression, but the first thing I
> thought of was pattern recognition. If your tokens are just characters
> in a string, you can let regular expressions do all the heavy lifting:
>
> (defn rep-groups [s] (re-seq #"(.+)\1+|." s))
>
> Then (rep-groups ".,ababcababc.,cc,.abab.") for example, gives
>
> (["." nil] ["," nil] ["ababcababc" "ababc"] ["." nil] ["," nil] ["cc"
> "c"] ["," nil] ["." nil] ["abab" "ab"] ["." nil])
>
> Now we just need to convert each of those match vectors into the form
> you want, like this:
>
> (map (fn [[mtch subst]] (if subst
>                             [(/ (count mtch) (count subst)) subst]
>                             mtch))
>      *1)    ;; Where *1 is the match results, above.
>
> Gives
>
> ("." "," [2 "ababc"] "." "," [2 "c"] "," "." [2 "ab"] ".")
>
> Cool! Pretty close to what you need. But you wanted to use arbitrary
> tokens, like :a, :b, etc. So let's just represent each token by a
> unique character. (There are LOTS of characters available.) Here's a
> function that creates two 1-to-1 maps -- one that converts tokens to
> characters, and the other its inverse, converting characters to their
> tokens:
>
> (defn token-to-char-map-and-inverse [token-collection]
>   (let [tokens (seq (set token-collection))
>         chars  (map char (range (count tokens)))]
>     [(zipmap tokens chars) (zipmap chars tokens)]))
>
> We'll need a presentation function like the (fn ...) above. Since it
> will also convert the characters back to tokens, it will also depend
> upon the char-to-token map. So the following function takes such a map
> and returns such a function:
>
> (defn count-subst-fn [c-to-t]
>   (fn [[mtch subst]]
>     (if subst
>       [(/ (count mtch) (count subst)) (map c-to-t subst)]
>       (apply c-to-t mtch))))
>
> Finally, put it all together:
>
> (defn parse-reps [token-seq]
>   (let
>       [[t-to-c c-to-t] (token-to-char-map-and-inverse token-seq)
>        s               (apply str (map t-to-c token-seq))]
>     (map (count-subst-fn c-to-t) (rep-groups s))))
>
> For example, (parse-reps [:_ :- :a :b :c :a :b :c :a :a :_]) gives
>
> (:_ :- [2 (:a :b :c)] [2 (:a)] :_)
>
> Notice that the input itself is used to define the set of tokens we're
> using. We could even us a string as input. (parse-reps
> ".,ababcababc.,cc,.abab.") gives
>
> (\. \, [2 (\a \b \a \b \c)] \. \, [2 (\c)] \, \. [2 (\a \b)] \.)
>
> These four (defn ...) provide a simple solution to your problem. Using
> regular expressions also yields a nice separation of concerns, I
> think, and it allows for experimentation with the pattern. As defined
> above, (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives
>
> ([2 (:a :b :a :b :c)])
>
> But let's try redefining the regular expression (note the "?"):
>
> (defn rep-groups [s] (re-seq #"(.+?)\1+|." s))
>
> Then the same (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives the
> less greedy result,
>
> ([2 (:a :b)] :c [2 (:a :b)] :c)

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Pimp my algorithm: finding repeating subsequences

Reply via email to