Re: New Fuzzy Matching Library - Fuzzy Matcher

Smit Shah Fri, 28 Jun 2013 05:26:01 -0700

Hey there,

Thank you so much for all the feedback, I really appreciate that you took 
the time to review you my code!


I have made few changes since then let me address them,

1)  I am aware of the bottom up dynamic algorithm for edit distance (using 
two dimensional array as a table), I just didn't know enough Clojure 
implement something like that (without mutation). That's why I picked the 
top down algorithm with memoization because it was functional and easier to 
implement in Clojure.

I agree with your points, unbounded memo would just not work in a long 
running program/process as it will eat up all the memory. To tackle that I 
have used a memoized with LU strategy for now (with 5k threshold value). 

https://github.com/Who828/fuzzy_matcher/commit/9643ad245f13d07e8e5b213b8597127d8f5d1913

I am grateful for the code example you have shown me, I will go over it 
this weekend and change my algorithm to the bottom up dynamic programming 
version. It would be a great opportunity for me to learn more about Clojure 
as well :)


2) Yes, I don't know what I was thinking when I implemented that search 
function. I took your advice and refactored it,

https://github.com/Who828/fuzzy_matcher/commit/33c1e0b0e6c4c9ba376b9df14a04ed22a3b526c1

Please have a look.

Again thank you so much for the feedback, please let me know if you have 
more feedback or any API/algorithm request for the library.

On Thursday, 27 June 2013 23:09:09 UTC+5:30, Jim foo.bar wrote:
>
>  Hi again,
>
> I see you have implemented your 'search' fn like this:
>
>   
>
> (defn search [word lst & {:keys [n rank] :or {n 15 rank 2}}]
>   "Get a list of words based on the minimum distance"
>   (let [ranked-words (apply hash-map
>                             (mapcat (fn [x] [x (edit-distance word x)]) lst))]
>     (take n (keys (sort-by val < (filter #(<= (second %) rank)
>              &nbsp
> ;                           ranked-words))))))
>
>
>
>
>
>
> You must have been very tired when you wrote this, weren't you? I bet if 
> you sit down with a clean head you'll find a much simpler way of doing 
> this...many people will hopefully agree...allow me to poke your brain :)
>
> Forget the above fn for a second...fundamentally, given a string and a 
> list of other strings you're looking *to keep only the strings* (from the 
> list) whose edit-distance (with respect to that first string) is lower or 
> equal to some tolerance value. If we generalise it a bit,  you're looking 
> to *filter* in, all the elements that match some predicate.  I'm not 
> giving you the answer but I hope it's pretty obvious now that it is 
> half-a-line of code :)
>
> I would perhaps add that the '(take n...' is not needed there as it is the 
> client's job to consume whatever he needs, but if you consider it 
> nitpicking, well, I can see your point... :)
>
> this is all in the spirit of wanting to help...
>
> Jim
>
>
>
>
> On 27/06/13 10:55, Jim wrote:
>  
> Hi Smit,
>
> I hope you don't mind a couple of comments :)
>
> I had a look at your edit-distance implementation and you've not followed 
> the recommended approach found in various textbooks (dynamic-programming) 
> and that's probably why you resorted to memoization...
>
> You see, trying your code without memoization does not perform very 
> well...In fact, for strings larger than 8-9 characters your code needs 
> nearly 10 seconds!!! Also, unrestricted memoization is not really a viable 
> option here. If you were to do any serious work (on a large corpus for 
> instance), you'll find that your memory will keep increasing up to the 
> point where your heap will fill up...
>
>
> So, I can suggest 2 things for you...if you want to keep the 'naive' 
> levenhstein-distance version,  a more clever memoization strategy is what I 
> 'd go for in order to avoid filling up your cache in a real 
> program...However, if you want to do it the right way, which will get rid 
> of your memoization needs, I suggest you look at my hotel_nlp project, in 
> the 'algorithms' namespace: 
>
>  
> https://github.com/jimpil/hotel-nlp/blob/master/src/hotel_nlp/algorithms/levenshtein.clj
>
> One thing that you might find confusing is these 2 let bindings:
>
> m (inc (count s1)) ; do not skip the last character
> n (inc (count s2))  ; do not skip the last character
>
> I'm only doing that because my 'matrix' fn uses 'range' which excludes the 
> upper limit...I guess I could change that to be more evident...
>
> even without memoization, the dynamic-programming solution can do two 
> 16-character-long strings in just over 1ms which is pretty sweet... :)
>
> hope that helps,
>
> Jim 
>  
>
>
> On 27/06/13 05:59, Smit Shah wrote:
>  
> Hey folks, 
>
>  This is my first post on the mailing list, I would like to introduce 
> myself. I have been love with Lisp since I started reading SICP, the way 
> you express problems and decompose them it's just amazing. I always wanted 
> to build a real world application in Lisp but I never got around it 
> (working in a startup leaves you with very little time).
>
>  Recently, I discovered talks by Rich Hickey (thanks to Baishampayan 
> Ghose!) and I was very impressed. I decided to learn Clojure so I picked up 
> "Joy of Clojure"  and wrote a small library for fuzzy matching 
> using Levenshtein distance.
>
>  So I have an very early version of the 
> fuzzy-matcher<https://github.com/Who828/fuzzy_matcher>, 
> a library to fuzzily match strings in Clojure.
>
>  Currently it has two APIs,
>  
>    - edit_distance which gives you Levenshtein distance between two 
>    strings. 
>    - search which takes a string and a list of strings, finally it gives 
>    you a list of strings which sorted by the edit_distance. 
>
>
>  I am planning to add more algorithms to this library like Sellers, 
> Hamming, PairDistance etc. I just wanted feedback from you folks.  
> I really want to help out the Clojure community by contributing back and 
> learn lot of new things in the process.
>
>  Please if you have any API suggestions or algorithms you want me to add, 
> let me know. I would do my best to add that feature as soon as possible.
>
>  Regards,
>  Smit
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com <javascript:>
> Note that posts from new members are moderated - please be patient with 
> your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com <javascript:>
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+u...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  
>
>
>  
>  

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: New Fuzzy Matching Library - Fuzzy Matcher

Reply via email to