Hi, Andy. No I haven't - Thanks for the pointer - I'll take a look.
I have a very specific usecase - the speeding up of the re-combination of the results of parallel reductions. I've found that the cost of this recombination often impacts badly on the overall timing of such a reduction, but with a little thought we can speed it up because we know that both seqs involved in the recombination are of the same type... Given the proliferation of cores and FP's promise of simple and elegant parallel code, I think that it is critical that Clojure really delivers on this promise and this issue is the last blocker for me for a couple of usecases that I have :-) I hope that explains where I am coming from - does it seem reasonable ? am I missing anything ? thanks again, Jules On Thursday, 20 February 2014 23:25:29 UTC, Andy Fingerhut wrote: > > Have you looked at core.rrb-vector, which implements Relaxed Radix Trees > for vectors that implement subvec and concatenating two arbitrary vectors > in O(log n) time? > > https://github.com/clojure/core.rrb-vector > > Unless I am missing something, the fast concatenation is what you are > trying to achieve? > > Andy > > > On Thu, Feb 20, 2014 at 2:59 PM, Jules <jules....@gmail.com > <javascript:>>wrote: > >> >> So, having broken the back of fast re-combination of hash sets and maps, >> I wanted to take a look at doing a similar sort of thing for vectors - >> another type of seq that I use very heavily in this sort of situation. >> >> Let's see the cake first: >> >> seqspert.vector=> (def a (into [] (range 10000000))) >> #'seqspert.vector/a >> seqspert.vector=> (time (def b (into a a))) >> "Elapsed time: 3695.682654 msecs" >> #'seqspert.vector/b >> seqspert.vector=> (time (def c (supervec a a))) >> "Elapsed time: 0.253922 msecs" >> #'seqspert.vector/c >> seqspert.vector=> (= b c) >> true >> >> As you can see, the usual method of combining two vectors into a third >> takes 3-4 secs to combine two 10M entry vectors. >> >> Using 'supervec' it takes MUCH less time..... >> >> How ? >> >> I looked at vector's impl and quickly realised that I would not be able >> to use the same trick that I had used for hash set/map. Then I looked at >> subvec and realised that I could do a very similar trick to create supervec. >> >> Subvec provides a view on a subseq of a vector that behaves like a full >> vector. Supervec provides a similar view that makes two vectors behave like >> a single one: >> >> seqspert.vector=> (supervec [1 2 3] [4 5 6]) >> [1 2 3 4 5 6] >> seqspert.vector=> (type (supervec [1 2 3] [4 5 6])) >> clojure.lang.APersistentVector$SuperVector >> >> We can use seqspert (see my other posting this evening) to look inside a >> SuperVector: >> >> seqspert.vector=> (decloak (supervec [1 2 3] [4 5 6])) >> #seqspert.vector.SuperVector{:left #seqspert.vector.Vector{:cnt 3, :shift >> 5, :root #seqspert.vector.VectorNode{:array [nil nil nil nil nil nil nil >> nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil >> nil nil nil nil nil nil]}, :tail [1 2 3]}, :right >> #seqspert.vector.Vector{:cnt 3, :shift 5, :root >> #seqspert.vector.VectorNode{:array [nil nil nil nil nil nil nil nil nil nil >> nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil >> nil nil nil]}, :tail [4 5 6]}, :middle 3, :count 6} >> seqspert.vector=> (p/pprint (decloak (supervec [1 2 3] [4 5 6]))) >> {:left >> {:cnt 3, >> :shift 5, >> :root >> {:array >> [nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil]}, >> :tail [1 2 3]}, >> :right >> {:cnt 3, >> :shift 5, >> :root >> {:array >> [nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil >> nil]}, >> :tail [4 5 6]}, >> :middle 3, >> :count 6} >> nil >> seqspert.vector=> >> >> It looks very complicated - that is the two clojurer vectors held inside >> ! A SuperVector has fields left, right (the two vectors that we are >> combining), middle (the length of the left vector), and count (the sum of >> the counts of left and right). >> >> With just this info and very little code, I found myself with a working >> impl (except 'pop' but who 'pop's a vector?). >> >> OK - there is a tiny extra runtime overhead associated with using a >> supervec - but, when you look at what is involved in dereffing into a >> vector, this quickly appears insignificant. It's about the same a using a >> subvec. >> >> The SuperVector code is checked into my clojure fork on github - see >> beginning of this thread. Seqspert is also at github - see the announcement >> I posted this evening. >> >> Finally 'supervec' is currently defined as: >> >> (defn supervec [l r] (APersistentVector$SuperVector. {} l r)) >> >> I am thinking or renaming 'splice' from my earlier posting to 'combine' >> and creating a Combinable interface, to be implemented by any seq type that >> is able to do a fast re-combination with another seq of the same type. This >> could be leveraged by e.g. 'into' to deliver significant performance >> benefits and footprint reduction. >> >> more as and when, >> >> >> Jules >> >> >> >> >> >> On Saturday, 15 February 2014 23:06:24 UTC, Jules wrote: >> >>> Guys, >>> >>> I've been playing with reducers on and off for a while but have been >>> frustrated because they don't seem to fit a particular usecase that I have >>> in mind... specifically: getting as many associations into a hash-map as as >>> I can in as short a time as possible. >>> >>> My understanding of the reason for this is that reducers practice a >>> divide and conquer strategy. The incoming sequence is divided up. Each >>> sub-sequence is reduced into a sub-result (possibly in parallel) and then >>> the sub-results are combined into the the final outgoing result. >>> >>> Unfortunately, there does not seem to be a better way of combining two >>> hash-maps other than to read each entry from one and create a new >>> corresponding association in the other. This means that each recombination >>> in the above process essentially repeats most of the work already performed >>> in the previous reduction stage. >>> >>> Hash-sets are implemented via hash-maps, and simpler with which to >>> demonstrate this problem: >>> >>> user=> (def a (doall (range 10000000))) >>> #'user/a >>> user=> (def b (doall (range 5000000 15000000))) >>> #'user/b >>> user=> (time (def c (into #{} a))) >>> "Elapsed time: 6319.392669 msecs" >>> #'user/c >>> user=> (time (def d (into #{} b))) >>> "Elapsed time: 5389.805233 msecs" >>> #'user/d >>> user=> (time (def e (into c d))) >>> "Elapsed time: 8486.032191 msecs" >>> #'user/e >>> >>> >>> In the example above, you can see that the reduction into hash-sets of >>> two overlapping lists of 10,000,000 elements takes 6.3 and 5.4 seconds. >>> This stage can be carried out in parallel i.e. time elapsed for this stage >>> would be 6.3 seconds - but we now have two hash-sets and we want one, so we >>> have to combine them. >>> >>> >>> user=> (time (def e (into c d))) >>> "Elapsed time: 8486.032191 msecs" >>> #'user/e >>> >>> As you can see, all the advantages of splitting the original sequence >>> into 2 and processing the two halves in parallel are lost since the >>> recombination or their results takes 8.5 seconds - more than we saved by >>> doing the reduction in parallel. >>> >>> So, what can we do about it ? >>> >>> I had a look at the code for PersistentHashMap (PersistentHashSet uses >>> PersistantHashMap internally). I realised that it was possible to "splice" >>> together the internal structure of two hash maps into a single one without >>> repeating most of the work required to build one from scratch. So, I had a >>> go at implementing it: >>> >>> >>> user=> (time (def f (clojure.lang.PersistentHashSet/splice c d))) >>> "Elapsed time: 3052.690911 msecs" >>> #'user/f >>> >>> and: >>> >>> user=> (= e f) >>> true >>> >>> Whilst this is still adding 3 seconds to our time, that 3 seconds is >>> half the time that we would have added had we executed the second reduction >>> in serial, rather than in parallel. >>> >>> This means that we can now reduce large datasets into sets/maps more >>> quickly in parallel than we can in serial :-) As an added benefit, because >>> splice reuses as much of the internal structure of both inputs as possible, >>> it's impact in terms of heap consumption and churn is less - although I >>> think that a full implementation might add some Java-side code complexity. >>> >>> If you would like to give 'splice' a try out, you will need to clone my >>> fork of clojure at github >>> >>> https://github.com/JulesGosnell/clojure >>> >>> Please let me know if you try out the code. I would be interested to >>> hear if people think it is worth pursuing. >>> >>> I was also thinking that it should be possible to use a similar trick to >>> quickly and cheaply split a map/set into [roughly] equal sized pieces >>> (assuming an good hash distribution). This would enable the use of a >>> map/set as an input sequence into the parallel reduction process outlined >>> above. Currently, I believe that only a vector can be used in this way. It >>> would be harder to arrange that 'count' could be implemented efficiently on >>> these sub-maps/sets, but this is not important for the reduction process. >>> >>> BTW - benchmarks were run on a 3.2ghz Phenom II / clojure/master / >>> openjdk-1.7.0_51 / Fedora 20 with min and max 4gb ram. >>> >>> regards, >>> >>> >>> >>> Jules >>> >>> >>> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com<javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+u...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.