I would have thought so - it's only my first cut - seems to work but I wouldn't like to stake my life on it. It really needs a developer who is familiar with PersistentHashMap to look it over and give it the thumbs up...Still, I guess if it was marked "experimental" ...:-)
Jules On Sunday, 16 February 2014 10:49:38 UTC, Mikera wrote: > > Wow - that's a pretty big win. I think we should try and get this into > Clojure ASAP. > > Are we too late for 1.6? > > On Sunday, 16 February 2014 18:48:09 UTC+8, Jules wrote: >> >> Thanks, Mikera >> >> You are right about merge: >> >> user=> (def m1 (apply hash-map (range 10000000))) >> #'user/m1 >> user=> (def m2 (apply hash-map (range 5000000 15000000))) >> #'user/m2 >> user=> (time (def m3 (merge m1 m2))) >> "Elapsed time: 5432.184582 msecs" >> #'user/m3 >> user=> (time (def m4 (clojure.lang.PersistentHashMap/splice m1 m2))) >> "Elapsed time: 1064.268269 msecs" >> #'user/m4 >> user=> (= m3 m4) >> true >> user=> >> >> as you would expect, a splice is faster and causes less of a memory spike. >> >> >> Jules >> >> >> On Sunday, 16 February 2014 10:01:04 UTC, Mikera wrote: >>> >>> +1 for this approach - I've wanted something like this several times. >>> >>> It's only an "optimisation", but it's a very useful one. Same technique >>> can probably be used to accelerate "merge" significantly which is a pretty >>> common operation when you are building map-like structures. >>> >>> On Sunday, 16 February 2014 07:06:24 UTC+8, Jules wrote: >>>> >>>> Guys, >>>> >>>> I've been playing with reducers on and off for a while but have been >>>> frustrated because they don't seem to fit a particular usecase that I have >>>> in mind... specifically: getting as many associations into a hash-map as >>>> as >>>> I can in as short a time as possible. >>>> >>>> My understanding of the reason for this is that reducers practice a >>>> divide and conquer strategy. The incoming sequence is divided up. Each >>>> sub-sequence is reduced into a sub-result (possibly in parallel) and then >>>> the sub-results are combined into the the final outgoing result. >>>> >>>> Unfortunately, there does not seem to be a better way of combining two >>>> hash-maps other than to read each entry from one and create a new >>>> corresponding association in the other. This means that each recombination >>>> in the above process essentially repeats most of the work already >>>> performed >>>> in the previous reduction stage. >>>> >>>> Hash-sets are implemented via hash-maps, and simpler with which to >>>> demonstrate this problem: >>>> >>>> user=> (def a (doall (range 10000000))) >>>> #'user/a >>>> user=> (def b (doall (range 5000000 15000000))) >>>> #'user/b >>>> user=> (time (def c (into #{} a))) >>>> "Elapsed time: 6319.392669 msecs" >>>> #'user/c >>>> user=> (time (def d (into #{} b))) >>>> "Elapsed time: 5389.805233 msecs" >>>> #'user/d >>>> user=> (time (def e (into c d))) >>>> "Elapsed time: 8486.032191 msecs" >>>> #'user/e >>>> >>>> >>>> In the example above, you can see that the reduction into hash-sets of >>>> two overlapping lists of 10,000,000 elements takes 6.3 and 5.4 seconds. >>>> This stage can be carried out in parallel i.e. time elapsed for this stage >>>> would be 6.3 seconds - but we now have two hash-sets and we want one, so >>>> we >>>> have to combine them. >>>> >>>> >>>> user=> (time (def e (into c d))) >>>> "Elapsed time: 8486.032191 msecs" >>>> #'user/e >>>> >>>> As you can see, all the advantages of splitting the original sequence >>>> into 2 and processing the two halves in parallel are lost since the >>>> recombination or their results takes 8.5 seconds - more than we saved by >>>> doing the reduction in parallel. >>>> >>>> So, what can we do about it ? >>>> >>>> I had a look at the code for PersistentHashMap (PersistentHashSet uses >>>> PersistantHashMap internally). I realised that it was possible to "splice" >>>> together the internal structure of two hash maps into a single one without >>>> repeating most of the work required to build one from scratch. So, I had a >>>> go at implementing it: >>>> >>>> >>>> user=> (time (def f (clojure.lang.PersistentHashSet/splice c d))) >>>> "Elapsed time: 3052.690911 msecs" >>>> #'user/f >>>> >>>> and: >>>> >>>> user=> (= e f) >>>> true >>>> >>>> Whilst this is still adding 3 seconds to our time, that 3 seconds is >>>> half the time that we would have added had we executed the second >>>> reduction >>>> in serial, rather than in parallel. >>>> >>>> This means that we can now reduce large datasets into sets/maps more >>>> quickly in parallel than we can in serial :-) As an added benefit, because >>>> splice reuses as much of the internal structure of both inputs as >>>> possible, >>>> it's impact in terms of heap consumption and churn is less - although I >>>> think that a full implementation might add some Java-side code complexity. >>>> >>>> If you would like to give 'splice' a try out, you will need to clone my >>>> fork of clojure at github >>>> >>>> https://github.com/JulesGosnell/clojure >>>> >>>> Please let me know if you try out the code. I would be interested to >>>> hear if people think it is worth pursuing. >>>> >>>> I was also thinking that it should be possible to use a similar trick >>>> to quickly and cheaply split a map/set into [roughly] equal sized pieces >>>> (assuming an good hash distribution). This would enable the use of a >>>> map/set as an input sequence into the parallel reduction process outlined >>>> above. Currently, I believe that only a vector can be used in this way. It >>>> would be harder to arrange that 'count' could be implemented efficiently >>>> on >>>> these sub-maps/sets, but this is not important for the reduction process. >>>> >>>> BTW - benchmarks were run on a 3.2ghz Phenom II / clojure/master / >>>> openjdk-1.7.0_51 / Fedora 20 with min and max 4gb ram. >>>> >>>> regards, >>>> >>>> >>>> >>>> Jules >>>> >>>> >>>> -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.