Re: Parallelising over a lazy sequence - request for help

2013-10-02 Thread Paul Butcher
Alan, Apologies for the delayed reply - I remember Iota well (there was some cross-fertilisation between it and foldable-seq a few months back IIRC :-) Having said that, I don't think that Iota will help in my particular situation (although I'd be delighted to be proven wrong)? Given that the f

Re: Parallelising over a lazy sequence - request for help

2013-09-30 Thread Alan Busby
Sorry to jump in, but I thought it worthwhile to add a couple points; (sorry for being brief) 1. Reducers work fine with data much larger than memory, you just need to mmap() the data you're working with so Clojure thinks everything is in memory when it isn't. Reducer access is fairly sequential,

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Brian Craft
On the other hand it is 2013, not 2003. 40G is small in terms of modern hardware. Terabyte ram servers have been available for awhile, at prices within the reach of many projects. "Large" data in this decade is measured in petabytes, at least. On Sunday, September 29, 2013 5:13:14 PM UTC-7, Pa

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Thanks - when I said "small", I was referring to the fact that your tests were using the first 1 pages, as opposed to the entire data dump. Sorry if I was unclear or misunderstood. On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote: > > The dataset I'm using is a Wikipedia d

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Butcher
On 29 Sep 2013, at 22:58, Paul Mooser wrote: > Paul, is there any easy way to get the (small) dataset you're working with, > so we can run your actual code against the same data? The dataset I'm using is a Wikipedia dump, which hardly counts as "small" :-) Having said that, the first couple of

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Paul, is there any easy way to get the (small) dataset you're working with, so we can run your actual code against the same data? On Saturday, May 25, 2013 9:34:15 AM UTC-7, Paul Butcher wrote: > > > The example counts the words contained within a Wikipedia dump. It should > respond well to para

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Stuart Halloway
To be clear, I don't object to the approach, only to naming it "fold" and/or tying it to interfaces related to folding. Stu On Sat, Sep 28, 2013 at 5:29 PM, Paul Butcher wrote: > On 28 Sep 2013, at 22:00, Alex Miller wrote: > > Reducers (and fork/join in general) are best suited for fine-grai

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 22:00, Alex Miller wrote: > Reducers (and fork/join in general) are best suited for fine-grained > computational parallelism on in-memory data. The problem in question involves > processing more data than will fit in memory. > > So the question is then what is the best way t

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Thanks Alex - I've made both of these changes. The shutdown-agents did get rid of the pause at the end of the pmap solution, and the -server argument made a very slight across-the-board performance improvement. But neither of them fundamentally change the basic result (that the implementation th

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Can't your last possible solution rather be implemented on top of f/j pool? Is it possible to beat f/j pool performance with ad-hoc thread-pool in situations where there are thousands of tasks? JW On Sat, Sep 28, 2013 at 11:00 PM, Alex Miller wrote: > Reducers (and fork/join in general) are be

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
I am hoping that this will be fixed for 1.6 but no one is actually "working" on it afaik. If someone wants to take it on, I would GREATLY appreciate a patch on this ticket (must be a contributor of course). On Saturday, September 28, 2013 11:24:18 AM UTC-5, Paul Butcher wrote: > > On 28 Sep 2013

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are man

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
For your timings, I would also strongly recommend altering your project.clj to force the -server hotspot: :jvm-opts ^:replace ["-Xmx1g" "-server" ... and whatever else you want here ... ] By default lein will use tiered compilation to optimize repl startup, which is not what you want for ti

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 19:51, Jozef Wagner wrote: > Anyway, I think the bottleneck in your code is at > https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9 > Instead of creating new persistent map for each word, you should use a > transient here. I would love

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
If a Clojure ticket is triaged, it means that one of the Clojure screeners believe the ticket's description describes a real issue with Clojure that ought to be changed in some way, and would like Rich Hickey to look at it and see whether he agress. If he does, it becomes vetted. A diagram of the

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Or even better, use guava's Multiset there... On Saturday, September 28, 2013 8:51:56 PM UTC+2, Jozef Wagner wrote: > > Well it should be possible to implement a foldseq variant which takes a > reducible collection as an input. This would speed things, as you don't > create so much garbage with

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Well it should be possible to implement a foldseq variant which takes a reducible collection as an input. This would speed things, as you don't create so much garbage with reducers. XML parser which produces reducible collection will be a bit harder :). Anyway, I think the bottleneck in your c

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:42, Jozef Wagner wrote: > I mean that you should forgot about lazy sequences and sequences in general, > if you want to have a cutting edge performance with reducers. Example of > reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into > the head. OK

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I mean that you should forgot about lazy sequences and sequences in general, if you want to have a cutting edge performance with reducers. Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into the head. JW On Sat, Sep 28, 2013 at 6:24 PM, Paul Butcher wrote: >

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:14, Jozef Wagner wrote: > I would go a bit more further and suggest that you do not use sequences at > all and work only with reducible/foldable collections. Make an input reader > which returns a foldable collection and you will have the most performant > solution. The t

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Ah - one mystery down. Thanks Andy! -- paul.butcher->msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 20

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on righ

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
I do not know about the most important parts of your performance difficulties, but on a more trivial point I might be able to shed some light. See the ClojureDocs page for pmap, which refers to the page for future, linked below. If you call (shutdown-agents) the 60-second wait to exit should go

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 01:22, Rich Morin wrote: >> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher wrote: >> I'm currently working on a book on concurrent/parallel development for The >> Pragmatic Programmers. ... > > Ordered; PDF just arrived (:-). Cool - very interested to hear your feedback onc

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 00:27, Stuart Halloway wrote: > I have posted an example that shows partition-then-fold at > https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. > > I would be curious to know how this approach performs with your data. W

Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Rich Morin
> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher wrote: > I'm currently working on a book on concurrent/parallel development for The > Pragmatic Programmers. ... Ordered; PDF just arrived (:-). I don't know yet whether the book has anything like this, but I'd like to see a table that shows whi

Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Stuart Halloway
Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold

Parallelising over a lazy sequence - request for help

2013-05-25 Thread Paul Butcher
I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example coun