Re: fast parallel reduction into hash-set/map

Jules Sun, 16 Feb 2014 02:58:16 -0800

I would have thought so - it's only my first cut - seems to work but I 
wouldn't like to stake my life on it. It really needs a developer who is 
familiar with PersistentHashMap to look it over and give it the thumbs 
up...Still, I guess if it was marked "experimental" ...:-)


Jules


On Sunday, 16 February 2014 10:49:38 UTC, Mikera wrote:
>
> Wow - that's a pretty big win. I think we should try and get this into 
> Clojure ASAP.
>
> Are we too late for 1.6?
>
> On Sunday, 16 February 2014 18:48:09 UTC+8, Jules wrote:
>>
>> Thanks, Mikera
>>
>> You are right about merge:
>>
>> user=> (def m1 (apply hash-map (range 10000000)))
>> #'user/m1
>> user=> (def m2 (apply hash-map (range 5000000 15000000)))
>> #'user/m2
>> user=> (time (def m3 (merge m1 m2)))
>> "Elapsed time: 5432.184582 msecs"
>> #'user/m3
>> user=> (time (def m4 (clojure.lang.PersistentHashMap/splice m1 m2)))
>> "Elapsed time: 1064.268269 msecs"
>> #'user/m4
>> user=> (= m3 m4)
>> true
>> user=> 
>>
>> as you would expect, a splice is faster and causes less of a memory spike.
>>
>>
>> Jules
>>
>>
>> On Sunday, 16 February 2014 10:01:04 UTC, Mikera wrote:
>>>
>>> +1 for this approach - I've wanted something like this several times.
>>>
>>> It's only an "optimisation", but it's a very useful one. Same technique 
>>> can probably be used to accelerate "merge" significantly  which is a pretty 
>>> common operation when you are building map-like structures.
>>>
>>> On Sunday, 16 February 2014 07:06:24 UTC+8, Jules wrote:
>>>>
>>>> Guys,
>>>>
>>>> I've been playing with reducers on and off for a while but have been 
>>>> frustrated because they don't seem to fit a particular usecase that I have 
>>>> in mind... specifically: getting as many associations into a hash-map as 
>>>> as 
>>>> I can in as short a time as possible.
>>>>
>>>> My understanding of the reason for this is that reducers practice a 
>>>> divide and conquer strategy. The incoming sequence is divided up. Each 
>>>> sub-sequence is reduced into a sub-result (possibly in parallel) and then 
>>>> the sub-results are combined into the the final outgoing result.
>>>>
>>>> Unfortunately, there does not seem to be a better way of combining two 
>>>> hash-maps other than to read each entry from one and create a new 
>>>> corresponding association in the other. This means that each recombination 
>>>> in the above process essentially repeats most of the work already 
>>>> performed 
>>>> in the previous reduction stage.
>>>>
>>>> Hash-sets are implemented via hash-maps, and simpler with which to 
>>>> demonstrate this problem:
>>>>
>>>> user=> (def a (doall (range 10000000)))
>>>> #'user/a
>>>> user=> (def b (doall (range 5000000 15000000)))
>>>> #'user/b
>>>> user=> (time (def c (into #{} a)))
>>>> "Elapsed time: 6319.392669 msecs"
>>>> #'user/c
>>>> user=> (time (def d (into #{} b)))
>>>> "Elapsed time: 5389.805233 msecs"
>>>> #'user/d
>>>> user=> (time (def e (into c d)))
>>>> "Elapsed time: 8486.032191 msecs"
>>>> #'user/e
>>>>
>>>>
>>>> In the example above, you can see that the reduction into hash-sets of 
>>>> two overlapping lists of 10,000,000 elements takes 6.3 and 5.4 seconds. 
>>>> This stage can be carried out in parallel i.e. time elapsed for this stage 
>>>> would be 6.3 seconds - but we now have two hash-sets and we want one, so 
>>>> we 
>>>> have to combine them.
>>>>
>>>>
>>>> user=> (time (def e (into c d)))
>>>> "Elapsed time: 8486.032191 msecs"
>>>> #'user/e
>>>>
>>>> As you can see, all the advantages of splitting the original sequence 
>>>> into 2 and processing the two halves in parallel are lost since the 
>>>> recombination or their results takes 8.5 seconds - more than we saved by 
>>>> doing the reduction in parallel.
>>>>
>>>> So, what can we do about it ?
>>>>
>>>> I had a look at the code for PersistentHashMap (PersistentHashSet uses 
>>>> PersistantHashMap internally). I realised that it was possible to "splice" 
>>>> together the internal structure of two hash maps into a single one without 
>>>> repeating most of the work required to build one from scratch. So, I had a 
>>>> go at implementing it:
>>>>
>>>>
>>>> user=> (time (def f (clojure.lang.PersistentHashSet/splice c d)))
>>>> "Elapsed time: 3052.690911 msecs"
>>>> #'user/f
>>>>
>>>> and:
>>>>
>>>> user=> (= e f)
>>>> true
>>>>
>>>> Whilst this is still adding 3 seconds to our time, that 3 seconds is 
>>>> half the time that we would have added had we executed the second 
>>>> reduction 
>>>> in serial, rather than in parallel.
>>>>
>>>> This means that we can now reduce large datasets into sets/maps more 
>>>> quickly in parallel than we can in serial :-) As an added benefit, because 
>>>> splice reuses as much of the internal structure of both inputs as 
>>>> possible, 
>>>> it's impact in terms of heap consumption and churn is less - although I 
>>>> think that a full implementation might add some Java-side code complexity.
>>>>
>>>> If you would like to give 'splice' a try out, you will need to clone my 
>>>> fork of clojure at github 
>>>>
>>>> https://github.com/JulesGosnell/clojure
>>>>
>>>> Please let me know if you try out the code. I would be interested to 
>>>> hear if people think it is worth pursuing.
>>>>
>>>> I was also thinking that it should be possible to use a similar trick 
>>>> to quickly and cheaply split a map/set into [roughly] equal sized pieces 
>>>> (assuming an good hash distribution). This would enable the use of a 
>>>> map/set as an input sequence into the parallel reduction process outlined 
>>>> above. Currently, I believe that only a vector can be used in this way. It 
>>>> would be harder to arrange that 'count' could be implemented efficiently 
>>>> on 
>>>> these sub-maps/sets, but this is not important for the reduction process.
>>>>
>>>> BTW - benchmarks were run on a 3.2ghz Phenom II / clojure/master / 
>>>> openjdk-1.7.0_51 / Fedora 20 with min and max 4gb ram.
>>>>
>>>> regards,
>>>>
>>>>
>>>>
>>>> Jules
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: fast parallel reduction into hash-set/map

Reply via email to