I've been experimenting with reducers using a small example that counts the
words in Wikipedia pages by parsing the Wikipedia XML dump. The basic structure
of the code is:
(frequencies (flatten (map get-words (get-pages))))
where get-pages returns a lazy sequence of pages from the XML dump and
get-words takes a page and returns a sequence of the words on that page. The
above code takes ~40s to count the words on the first 10000 pages.
If I convert that code to use reducers, it runs in ~22s (yay!).
If I convert it to use fold and therefore run in parallel, it runs in ~13s on
my 4-core MacBook Pro. So it's faster (yay!) but nowhere near 4x faster (boo).
The primary reason for this is that, in order to be able to use fold, I've had
to write my own version of frequencies:
(defn frequencies-parallel [words]
(r/fold (partial merge-with +)
(fn [counts x] (assoc counts x (inc (get counts x 0))))
words))
And, unlike the version in core, this doesn't use transients. If I replace the
fold with reduce (i.e. make it run sequentially) it runs in ~43s.
So, I *am* getting close to a 4x speedup from parallelising the code, but
unfortunately I'm also seeing a 2x slowdown because I can't use transients.
Can anyone think of any way that it would be possible to modify this code to
use transients? Or any way to modify reducers to allow transients to be used?
--
paul.butcher->msgCount++
Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?
http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: [email protected]
AIM: paulrabutcher
Skype: paulrabutcher
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.