Very interesting solution. I implemented an LZ77 like solution, which appears to be working, but its good to have alternative solutions. As you point out, different match patterns might be interesting. I tune the matching by disallowing small matches based on length, but the pattern string would be more flexible than hard coded logic rules.
Thanks, -M On Feb 19, 1:05 pm, Tyler Perkins <thinks.outs...@gmail.com> wrote: > Interesting. This problem is like compression, but the first thing I > thought of was pattern recognition. If your tokens are just characters > in a string, you can let regular expressions do all the heavy lifting: > > (defn rep-groups [s] (re-seq #"(.+)\1+|." s)) > > Then (rep-groups ".,ababcababc.,cc,.abab.") for example, gives > > (["." nil] ["," nil] ["ababcababc" "ababc"] ["." nil] ["," nil] ["cc" > "c"] ["," nil] ["." nil] ["abab" "ab"] ["." nil]) > > Now we just need to convert each of those match vectors into the form > you want, like this: > > (map (fn [[mtch subst]] (if subst > [(/ (count mtch) (count subst)) subst] > mtch)) > *1) ;; Where *1 is the match results, above. > > Gives > > ("." "," [2 "ababc"] "." "," [2 "c"] "," "." [2 "ab"] ".") > > Cool! Pretty close to what you need. But you wanted to use arbitrary > tokens, like :a, :b, etc. So let's just represent each token by a > unique character. (There are LOTS of characters available.) Here's a > function that creates two 1-to-1 maps -- one that converts tokens to > characters, and the other its inverse, converting characters to their > tokens: > > (defn token-to-char-map-and-inverse [token-collection] > (let [tokens (seq (set token-collection)) > chars (map char (range (count tokens)))] > [(zipmap tokens chars) (zipmap chars tokens)])) > > We'll need a presentation function like the (fn ...) above. Since it > will also convert the characters back to tokens, it will also depend > upon the char-to-token map. So the following function takes such a map > and returns such a function: > > (defn count-subst-fn [c-to-t] > (fn [[mtch subst]] > (if subst > [(/ (count mtch) (count subst)) (map c-to-t subst)] > (apply c-to-t mtch)))) > > Finally, put it all together: > > (defn parse-reps [token-seq] > (let > [[t-to-c c-to-t] (token-to-char-map-and-inverse token-seq) > s (apply str (map t-to-c token-seq))] > (map (count-subst-fn c-to-t) (rep-groups s)))) > > For example, (parse-reps [:_ :- :a :b :c :a :b :c :a :a :_]) gives > > (:_ :- [2 (:a :b :c)] [2 (:a)] :_) > > Notice that the input itself is used to define the set of tokens we're > using. We could even us a string as input. (parse-reps > ".,ababcababc.,cc,.abab.") gives > > (\. \, [2 (\a \b \a \b \c)] \. \, [2 (\c)] \, \. [2 (\a \b)] \.) > > These four (defn ...) provide a simple solution to your problem. Using > regular expressions also yields a nice separation of concerns, I > think, and it allows for experimentation with the pattern. As defined > above, (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives > > ([2 (:a :b :a :b :c)]) > > But let's try redefining the regular expression (note the "?"): > > (defn rep-groups [s] (re-seq #"(.+?)\1+|." s)) > > Then the same (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives the > less greedy result, > > ([2 (:a :b)] :c [2 (:a :b)] :c) -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en