Re: Parallelising over a lazy sequence - request for help

Alan Busby Mon, 30 Sep 2013 03:16:12 -0700

Sorry to jump in, but I thought it worthwhile to add a couple points;
(sorry for being brief)


1. Reducers work fine with data much larger than memory, you just need to
mmap() the data you're working with so Clojure thinks everything is in
memory when it isn't. Reducer access is fairly sequential, not random, so
spinning disks work great here.

2. A 40GB XML file is very often many many smaller XML documents aggregated
together. It's often faster to separate each document into it's own line
(via various UNIX tools) and parse each line separately. I typically do
something like $ zcat bigxml.gz | tr '\n' ' ' | sed 's/<foo>/\n<foo>/' |
grep '^<foo>' > records.xml .

3. Check out the Iota library, https://github.com/thebusby/iota/ . I often
use for reducing over 100's of GB's worth of text data. It does what Jozef
suggests, and makes a text file a foldable collection.

4. While pmap is great for advertising the power of Clojure, it's likely
safe to say that it should be ignored if you're actually looking for
performance.


Hope this helps,
    Alan Busby

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

Reply via email to