Thanks for running the second experiment. The more I think about it, getting "typical" performance numbers for something like this is very difficult, just like getting "typical" numbers for sorting algorithms is difficult. Beware the evils of micro-benchmarking.
Does anyone have some ideas on how to perform a more rigorous experiment? Sean On Feb 18, 12:44 pm, Rowdy Rednose <rowdy.redn...@gmx.net> wrote: > Yeah, the numbers depend a bit on the data, so distinct will look > better if there are many dupes, and running tests against > > (mapcat vector (range 500000) (range 500000)) > > so we have 50% dupes, gave me these numbers: > > user=> (time (count (unique (mapcat vector (range 500000) (range > 500000))))) > "Elapsed time: 4319.796441 msecs" > 500000 > user=> (time (count (better-unique (mapcat vector (range 500000) > (range 500000))))) > "Elapsed time: 4107.580937 msecs" > 500000 > user=> (time (count (distinct (mapcat vector (range 500000) (range > 500000))))) > "Elapsed time: 7279.815677 msecs" > 500000 > user=> (time (count (parti-by-unique (mapcat vector (range 500000) > (range 500000))))) > "Elapsed time: 5889.013135 msecs" > 500000 > > So now my original unique is actually a little bit slower than better- > unique in all the runs I did. I guess this is due to the overhead of > atoms? > > Still distinct will never be faster than better-unique, as it has to > do more. > > On Feb 19, 1:26 am, Sean Devlin <francoisdev...@gmail.com> wrote: > > > Rowdy, > > Does your profiling test do any work? There aren't any duplicates in > > (range 1000000) Perhaps you should run a few more benchmarks to get a > > better idea of what is going on. > > > Sean > > > On Feb 18, 10:34 am, Rowdy Rednose <rowdy.redn...@gmx.net> wrote: > > > > Various interesting approaches to this problem... thanks guys! > > > > Stu, you're right, distinct builds up a set of all items encountered, > > > which I don't need, and it's costly. But that function it is indeed a > > > good model. So here's a side effects free version modeled after > > > distinct: > > > > (defn better-unique > > > [coll] > > > (let [step (fn step [xs last] > > > (lazy-seq > > > ((fn [[f :as xs] last] > > > (when-let [s (seq xs)] > > > (if (= last f) > > > (recur (rest s) f) > > > (cons f (step (rest s) f))))) > > > xs last)))] > > > (step coll (Object.)))) > > > > Doing some micro-benchmarking, it turns out my initial version is the > > > fastest, but better-unique is comparable. > > > > This one > > > > (defn parti-by-unique [sc] (map first (partition-by identity sc))) > > > > is much slower, and distinct is the slowest of the ones I tried, as it > > > does much more than I need. Here are the numbers (and I don't wanna > > > hear anybody laugh about the absolute values I achieve on my box ;) > > > > user=> (time (count (unique (range 1000000)))) > > > "Elapsed time: 1070.535608 msecs" > > > 1000000 > > > user=> (time (count (better-unique (range 1000000)))) > > > "Elapsed time: 1510.92021 msecs" > > > 1000000 > > > user=> (time (count (parti-by-unique (range 1000000)))) > > > "Elapsed time: 3344.861758 msecs" > > > 1000000 > > > user=> (time (count (distinct (range 1000000)))) > > > "Elapsed time: 6724.705348 msecs" > > > 1000000 > > > > And yes, for a generic function like this, that I want to use > > > everywhere and anywhere, I do consider performance. > > > > better-unique is not too far off from my initial approach, but still, > > > why no side effects? > > > Even though I do not know when the side-effects happen (lazily, > > > eagerly), shouldn't the order be fixed in case of filter? > > > > At least I'm assuming that every reset! of the last atom will happen > > > at a time so that the result of unique will not be affected. > > > > On Feb 18, 11:34 pm, Stuart Halloway <stuart.hallo...@gmail.com> > > > wrote: > > > > > Rowdy's question asks for less than what core/distinct delivers--he > > > > only wanted to remove *adjacent* duplicates. > > > > > That said, core/distinct is a better demo of "how do I maintain state > > > > across a lazy sequence without requiring any mutation?" > > > > > Stu > > > > > > Hi, > > > > > > On Feb 18, 3:04 pm, Rowdy Rednose <rowdy.redn...@gmx.net> wrote: > > > > > >> "Returns a lazy sequence of the items in coll for which (pred item) > > > > >> returns true. pred must be free of side-effects." > > > > > >> So that means I should not write a function like this: > > > > > >> (defn unique [sc] > > > > >> "Returns a lazy sequence with all consecutive duplicates removed" > > > > >> (let [last (atom (Object.))] > > > > >> (filter #(let [ok (not= @last %)] (reset! last %) ok) sc))) > > > > > >> user=> (unique [1 2 3 3 4 5 5 6]) > > > > >> (1 2 3 4 5 6) > > > > > >> But in contrast to functions that can be retried (compare-and-swap > > > > >> etc.), I don't immediately see why having side effects in filter > > > > >> would > > > > >> be bad. Can anybody enlighten me? And how should I do this instead? > > > > > > Besides the other suggestions: clojure.core/distinct and its > > > > > implementation. > > > > > > Sincerely > > > > > Meikel > > > > > > -- > > > > > You received this message because you are subscribed to the Google > > > > > Groups "Clojure" group. > > > > > To post to this group, send email to clojure@googlegroups.com > > > > > Note that posts from new members are moderated - please be patient > > > > > with your first post. > > > > > To unsubscribe from this group, send email to > > > > > clojure+unsubscr...@googlegroups.com > > > > > For more options, visit this group at > > > > >http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en