Here is the link to the gist for the code https://gist.github.com/yubrshen/63ffda973aff27d39868
On Wednesday, October 15, 2014 4:12:56 PM UTC+8, Yu Shen wrote: > > In the following code, I'm struggling to be idiomatic and at the same time > to consider save times of iterations. I sometimes find it conflicts between > the two objectives of expressiveness and performance, for example, in order > to save time of iterations over a sequence, I have to use (loop, and recur) > a lot, but for readability and expressiveness, some of such loop structure > could be more elegantly expressed as list expressions of filters, etc. > > I'd like to seek your advise, whether my following code could be improved > in both purposes? > > Thanks a lot for your help! > > Yu Shen > > > (ns dw-pos-study.pairwise-match > (:use incanter.stats) > (:use utils-yushen.util) > (:use clojure.test)) > > > ;; This package tries to find the most likely match between words > independent of order in two phrases. > ;; Care is taken to be efficient in order to support plausible > approximate match of names between users' text nad a database. > > > (defn switch [[x y]] > [y x]) > > > (defn pairwise-distances-sorted > "Returns sorted the pairwise distance between words of search and > candidate in increasing distance order. > The keyword parameter perfect-match is used to control how strict two > words are considered to be perfect match. > By default, it's 0, thus identical words are perfect match. Relax the > degree for perfect match may reduce the number of pairwise distances, and > the sorting them." > > > [search candidate & {:keys [perfect-match] :or {perfect-match 0 }}] > > > (->> > (loop [pairs (for [x search y candidate] [x y]) > distance-map {}] > (if (empty? pairs) > distance-map > (let [pair (first pairs)] > (cond > (get distance-map pair) (recur (rest pairs) distance-map) > (get distance-map (switch pair)) (recur (rest pairs) (merge > distance-map {pair (get distance-map (switch pair))})) > :else (let [distance (apply incanter.stats/levenshtein-distance > pair) > relative-distance (/ distance (apply max (map count > pair))) ; use relative distance to allow comparable tolerance in > determining prefect match across different pair. > ; the maximum length of the words in pair is the > maximum of the distance possible for the pair. > perfect-match? (<= relative-distance perfect-match) ; > try to optimize when there is perfect match found > ] > (recur (if perfect-match? (filter (fn [[x y]] (and (not= x > (first pair)) (not= y (second pair)))) (rest pairs)) (rest pairs)) ; > further reduce the search space of pairwise distances. > (merge distance-map {pair distance}))))))) > (into [], ) ; use sequence, as map's order is not satble > (sort-by second, ))) > > > ;; Copided from > http://programming-puzzler.blogspot.com/2010/07/translating-code-from-python-and-scheme.html > > thanks to mbAugust 2, 2010 at 10:59 PM > (defn remove-first > [item coll] > (lazy-seq > (when-let [s (seq coll)] ; use when-let, it automatically takes care > of returning nil when there are no more items in the sequence. > ; caching of the result of the seq call is that you save one level > of indirection for any following first, next or rest call. > (let [fst (first s)] > (if (= fst item) > (next s) > (cons fst (remove-first item (rest coll)))))))) > > > (defn matched-closest > "Find the closest matches of words between search and candidate > independent of word order, returning the matches, and the residual of not > matching. > The match data contains the matched word pair, the levenshtein distance > between them, and the maximum of word lengths of the words in the pair, > which serves as the context to tell how significant the distance is. > The keyword parameter perfect-match is used to control how strict two > words are considered to be perfect match. > By default, it's 0, thus identical words are perfect match." > > > [search candidate & {:keys [perfect-match] :or {perfect-match 0}}] > > > (let [word-distances-sorted (pairwise-distances-sorted search candidate > :perfect-match perfect-match)] > (loop [matched [] > filtered word-distances-sorted > [not-matched-search not-matched-candidate] [search candidate]] > (if (empty? filtered) > [matched [not-matched-search not-matched-candidate]] > (let [[[a b] d] (first filtered) > matched-updated (conj matched [[a b] d (max (count a) (count > b))]) > ; remove any element with key containing a or b in their > respective position, as [a b] is already considered, there is no point to > cornsider them anymore. > filtered-updated (filter (fn [[[x y] _]] (and (not= a x) > (not= b y))) > filtered)] > (recur matched-updated > filtered-updated > [(remove-first a not-matched-search) (remove-first b > not-matched-candidate)])))))) > > > (is (matched-closest (clojure.string/split "garbage real stuff" #"\s") > (clojure.string/split "real stuff" #"\s")) > [[[["stuff" "stuff"] 0 5] [["real" "real"] 0 4]] ['("garbage") '()]]) > > > (is (matched-closest (clojure.string/split "garbage real stuff" #"\s") > (clojure.string/split "real stuff nonsense" #"\s")) > [[[["stuff" "stuff"] 0 5] [["real" "real"] 0 4] [["garbage" "nonsense"] 7 > 8]] ['() '()]]) > > > (is (matched-closest (clojure.string/split "garbage real staff" #"\s") > (clojure.string/split "reel stuff nonsense" #"\s") :perfect-match 0.5) > [[[["staff" "stuff"] 1 5] [["real" "reel"] 1 4] [["garbage" "nonsense"] 7 > 8]] ['() '()]]) > > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.