Hey there, Thank you so much for all the feedback, I really appreciate that you took the time to review you my code!
I have made few changes since then let me address them, 1) I am aware of the bottom up dynamic algorithm for edit distance (using two dimensional array as a table), I just didn't know enough Clojure implement something like that (without mutation). That's why I picked the top down algorithm with memoization because it was functional and easier to implement in Clojure. I agree with your points, unbounded memo would just not work in a long running program/process as it will eat up all the memory. To tackle that I have used a memoized with LU strategy for now (with 5k threshold value). https://github.com/Who828/fuzzy_matcher/commit/9643ad245f13d07e8e5b213b8597127d8f5d1913 I am grateful for the code example you have shown me, I will go over it this weekend and change my algorithm to the bottom up dynamic programming version. It would be a great opportunity for me to learn more about Clojure as well :) 2) Yes, I don't know what I was thinking when I implemented that search function. I took your advice and refactored it, https://github.com/Who828/fuzzy_matcher/commit/33c1e0b0e6c4c9ba376b9df14a04ed22a3b526c1 Please have a look. Again thank you so much for the feedback, please let me know if you have more feedback or any API/algorithm request for the library. On Thursday, 27 June 2013 23:09:09 UTC+5:30, Jim foo.bar wrote: > > Hi again, > > I see you have implemented your 'search' fn like this: > > > > (defn search [word lst & {:keys [n rank] :or {n 15 rank 2}}] > "Get a list of words based on the minimum distance" > (let [ranked-words (apply hash-map > (mapcat (fn [x] [x (edit-distance word x)]) lst))] > (take n (keys (sort-by val < (filter #(<= (second %) rank) >   > ; ranked-words)))))) > > > > > > > You must have been very tired when you wrote this, weren't you? I bet if > you sit down with a clean head you'll find a much simpler way of doing > this...many people will hopefully agree...allow me to poke your brain :) > > Forget the above fn for a second...fundamentally, given a string and a > list of other strings you're looking *to keep only the strings* (from the > list) whose edit-distance (with respect to that first string) is lower or > equal to some tolerance value. If we generalise it a bit, you're looking > to *filter* in, all the elements that match some predicate. I'm not > giving you the answer but I hope it's pretty obvious now that it is > half-a-line of code :) > > I would perhaps add that the '(take n...' is not needed there as it is the > client's job to consume whatever he needs, but if you consider it > nitpicking, well, I can see your point... :) > > this is all in the spirit of wanting to help... > > Jim > > > > > On 27/06/13 10:55, Jim wrote: > > Hi Smit, > > I hope you don't mind a couple of comments :) > > I had a look at your edit-distance implementation and you've not followed > the recommended approach found in various textbooks (dynamic-programming) > and that's probably why you resorted to memoization... > > You see, trying your code without memoization does not perform very > well...In fact, for strings larger than 8-9 characters your code needs > nearly 10 seconds!!! Also, unrestricted memoization is not really a viable > option here. If you were to do any serious work (on a large corpus for > instance), you'll find that your memory will keep increasing up to the > point where your heap will fill up... > > > So, I can suggest 2 things for you...if you want to keep the 'naive' > levenhstein-distance version, a more clever memoization strategy is what I > 'd go for in order to avoid filling up your cache in a real > program...However, if you want to do it the right way, which will get rid > of your memoization needs, I suggest you look at my hotel_nlp project, in > the 'algorithms' namespace: > > > https://github.com/jimpil/hotel-nlp/blob/master/src/hotel_nlp/algorithms/levenshtein.clj > > One thing that you might find confusing is these 2 let bindings: > > m (inc (count s1)) ; do not skip the last character > n (inc (count s2)) ; do not skip the last character > > I'm only doing that because my 'matrix' fn uses 'range' which excludes the > upper limit...I guess I could change that to be more evident... > > even without memoization, the dynamic-programming solution can do two > 16-character-long strings in just over 1ms which is pretty sweet... :) > > hope that helps, > > Jim > > > > On 27/06/13 05:59, Smit Shah wrote: > > Hey folks, > > This is my first post on the mailing list, I would like to introduce > myself. I have been love with Lisp since I started reading SICP, the way > you express problems and decompose them it's just amazing. I always wanted > to build a real world application in Lisp but I never got around it > (working in a startup leaves you with very little time). > > Recently, I discovered talks by Rich Hickey (thanks to Baishampayan > Ghose!) and I was very impressed. I decided to learn Clojure so I picked up > "Joy of Clojure" and wrote a small library for fuzzy matching > using Levenshtein distance. > > So I have an very early version of the > fuzzy-matcher<https://github.com/Who828/fuzzy_matcher>, > a library to fuzzily match strings in Clojure. > > Currently it has two APIs, > > - edit_distance which gives you Levenshtein distance between two > strings. > - search which takes a string and a list of strings, finally it gives > you a list of strings which sorted by the edit_distance. > > > I am planning to add more algorithms to this library like Sellers, > Hamming, PairDistance etc. I just wanted feedback from you folks. > I really want to help out the Clojure community by contributing back and > learn lot of new things in the process. > > Please if you have any API suggestions or algorithms you want me to add, > let me know. I would do my best to add that feature as soon as possible. > > Regards, > Smit > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clo...@googlegroups.com <javascript:> > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+u...@googlegroups.com <javascript:> > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+u...@googlegroups.com <javascript:>. > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.