Re: Question about pmap

Andy Fingerhut Thu, 06 Aug 2009 11:05:05 -0700

On Aug 6, 10:00 am, Bradbev <brad.beveri...@gmail.com> wrote:
> On Aug 6, 3:07 am, Andy Fingerhut <andy_finger...@alum.wustl.edu>
> wrote:
>
>
>
> > On Aug 5, 6:09 am, Rich Hickey <richhic...@gmail.com> wrote:
>
> > > On Wed, Aug 5, 2009 at 8:29 AM, Johann Kraus<johann.kr...@gmail.com> 
> > > wrote:
>
> > > >> Could it be that your CPU has a single floating-point unit shared by 4
> > > >> cores on a single die, and thus only 2 floating-point units total for
> > > >> all 8 of your cores?  If so, then that fact, plus the fact that each
> > > >> core has its own separate ALU for integer operations, would seem to
> > > >> explain the results you are seeing.
>
> > > > Exactly, this would explain the behaviour. But unfortunately it is not
> > > > the case. I implemented a small example using Java (Java Threads) and
> > > > C (PThreads) and both times I get a linear speedup. See the attached
> > > > code below. The cores only share 12 MB cache, but this should be
> > > > enough memory for my micro-benchmark. Seeing the linear speedup in
> > > > Java and C, I would negate a hardware limitation.
>
> > > > _
> > > > Johann
>
> > > I looked briefly at your problem and don't see anything right off the
> > > bat. Do you have a profiler and could you try that out? I'm
> > > interested.
> > > Rich
>
> > I ran these tests on my iMac with 2.16 GHz Intel Core 2 Duo (2 cores)
> > using latest Clojure and clojure-contrib from git as of some time on
> > Aug 4, 2009.  The Java implementation is from Apple, version 1.6.0_13.
>
> > ----------------------------------------------------------------------
> > For int, there are 64 "jobs" run, each of which consists of doing
> > (inc 0) 1,000,000,000 times.  See pmap-batch.sh and pmap-testing.clj
> > for details.
>
> >http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
>
> >http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
>
> > Yes, yes, I know.  I should really use a library for command line
> > argument parsing to avoid so much repetitive code.  I may do that some
> > day.
>
> > Results for int 1 thread - jobs run sequentially
>
> > "Elapsed time: 267547.789 msecs"
> > real       269.22
> > user       268.61
> > sys          1.79
>
> > int 2 threads - jobs run in 2 threads using modified-pmap, which
> > limits the number of futures causing threads to run jobs to be at most
> > 2 at a time.
>
> > "Elapsed time: 177428.626 msecs"
> > real       179.14
> > user       330.30
> > sys         15.46
>
> > Comment: Elapsed time with 2 threads is about 2/3 of elapsed time with
> > 1 thread.  Not as good as the 1/2 as we'd like with a 2 core machine,
> > but better than not being faster at all.
>
> > ----------------------------------------------------------------------
> > For double, there are 16 "jobs" run, each of which consists of doing
> > (inc 0.1) 1,000,000,000 times.
>
> > double 1 thread
>
> > "Elapsed time: 258659.424 msecs"
> > real       263.28
> > user       247.29
> > sys         12.17
>
> > double 2 threads
>
> > "Elapsed time: 229382.68 msecs"
> > Dumping CPU usage by sampling running threads ... done.
> > real       231.05
> > user       380.79
> > sys         11.49
>
> > Comment: Elapsed time with 2 threads is about 7/8 of elapsed time with
> > 1 thread.  Hardly any improvement at all for something that should be
> > "embarrassingly parallel", and the user time reported by Mac OS X's
> > /usr/bin/time increased by a factor of about 1.5.  That seems like way
> > too much overhead for thread coordination.
>
> > Here are hprof output files for the "double 1 thread" and "double 2
> > threads" tests:
>
> >http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...
>
> >http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...
>
> > In both cases, over 98% of the time is spent in
> > java.lang.Double.valueOf(double d).  See the files for the full stack
> > backtraces if you are curious.
>
> > I don't see any reason why that method should have any kind of
> > contention or worse performance when running on 2 cores vs. 1 core,
> > but I don't know the guts of how it is implemented.  At least in
> > OpenJDK all it does is "return new Double(d)", where d is the double
> > arg to valueOf().  Is there any reason why "new" might exhibit
> > contention between parallel threads?
>
> Can you run your benchmarks with the number of concurrent threads
> being equal to the number of cores that you have?  The increase in
> system time is interesting to me - is it possible that the JVM or OS
> can detect threads that don't use floating point registers & therefore
> doesn't bother to save them when doing a thread context switch?  If
> so, that is a significant amount of memory that doesn't need to be
> touched during context switch.
>
> Brad

Johann who started this thread has an 8 core machine, but I don't.  My
machine has 2 cores.  All of my tests were with 1 thread, using map,
or 2 threads, using my modified-pmap which I've tested and confirmed
that it tries to evaluate at most 2 future calls at a time, so at most
2 threads at a time.

Andy

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: Question about pmap

Reply via email to