On Thu, 07 Feb 2013 08:32:46 -0800, Phil Steitz wrote:
On 2/7/13 8:04 AM, Gilles wrote:
On Thu, 07 Feb 2013 07:01:42 -0800, Phil Steitz wrote:
On 2/7/13 4:58 AM, Gilles wrote:
On Wed, 06 Feb 2013 09:46:55 -0800, Phil Steitz wrote:
On 2/6/13 9:03 AM, Gilles wrote:
On Wed, 06 Feb 2013 07:19:47 -0800, Phil Steitz wrote:
On 2/5/13 6:08 AM, Gilles wrote:
Hi.
In the thread about "static import", Stephen noted that
decisions
on a
component's evolution are dependent on whether the future of
the
Java
language is taken into account, or not.
A question on the same theme also arose after the
presentation of
Commons
Math in FOSDEM 2013.
If we assume that efficiency is among the important
qualities for
Commons
Math, the future is to allow usage of the tools provided by
the
standard
Java library in order to ease the development of
multi-threaded
algorithms.
Maintaining Java 1.5 source compatibility for the reason
that we
may need
to support legacy applications will turn out to be
self-defeating:
1. New users will not consider Commons Math's features that
are
notably
apt to parallel processing.
2. Current users might at some point simply switch to another
library if
it proves more efficient (because it actually uses
multi-threading).
3. New Java developers will be turned away because they will
want
to use
the more convenient features of the language in order to
provide
potential contributions.
If maintaining 1.5 source compatibility is kept as a
requirement, the
consequence is that Commons Math will _become_ a legacy
library.
In that perspective, implementing/improving algorithms for
which a
parallel version is known to be more efficient is plainly a
waste of
development and maintenance time.
In order to mitigate the risks (both of upgrading and of not
upgrading
the source compatibility requirement), I would propose to
create a
new
project (say, "Commons Math MT") where we could implement new
features[1]
without being encumbered with the 1.5 requirement.[2]
The "Commons Math MT" would depend on "Commons Math" where we
would
continue developing single-thread (and thread-safe) "tasks",
i.e.
independent units of processing that could be used in
algorithms
located in "Commons Math MT".
In summary:
- Commons Math (as usual):
* single-thread (sequential) algorithms,
* (pure) Java 5,
* no dependencies.
- Commons Math MT:
* multi-thread (parallel) algorithms,
* Java 7 and beyond,
* JNI allowed,
* dependencies allowed (jCuda).
What do you think?
There are several other possibilities to consider:
0) Implement multithreading using JDK 1.5 primitives
1) Set things up within [math] to support parallel execution in
JDK
1.7, Hadoop or other frameworks
2) Instead of a new project, start a 4.x branch targeting JDK
1.7
I think we should maintain a version that has no dependencies
and no
JNI in any case.
Starting a branch and getting concrete about how to parallelize
some
algorithms would be a good way to start. One thing I have not
really investigated and would be interested in details on is
what
you actually get in efficiency gain (or loss?) using fork /
join vs
just using 1.5+ concurrency for the kinds of problems we
would end
up using this stuff for.
Thinking about specific parallelization problem instances would
also
help decide whether 1) makes sense (i.e., whether it makes
sense as
you mention above to maintain a single-threaded library that
provides task execution for a multithreaded version or
multithreaded
frameworks).
One more thing to consider is that for at least some users of
[math], having the library internally spawn threads and/or peg
multiple processors may not be desirable. It is a little
misleading
to say that multithreading is the way to get "efficiency."
It is
really the way to *use* more compute resources and unless there
are
real algorithmic improvements, the overall efficiency may
actually
be less, due to task coordination overhead. What you get is
faster
execution due to more greedy utilization of available cores.
Actual
efficiency (how much overall compute resource it takes to
complete a
job) partly depends on how efficiently the coordination
itself is
done (which JDK 1.7 claims to do very well - I have just not
seen
substantiation or any benchmarks demonstrating this) and how
the
parallelization effects overall compute requirements. In any
case,
for environments where library thread-spawning is not
desirable, I
think we should maintain a single-threaded version.
Unless I missed the point, those reasons are exactly why I
propose to
have 2 projects/components. One, "Commons-Math", does not fiddle
with
resources, while the other would provide a
"parallelizationLevel"
setting for the algorithms written to possibly take advantage of
the
Java 5+ "task framework".
OK, what about the 4.x option?
Yes, we could still be good by using only Java 5's concurrency
features
but the issue I raise is not only about concurrency but about
evolution/progress/maintenance, all things that require raising
interest
from new contributors (unless it's fine that Commons Math be
tagged as a
"library of the past"...).
+1 for experimenting with parallelization. I would just like to
understand if the JDK 7 stuff really adds much - in particular,
does
it handle coordination / cpu allocation better than you could
easily
do it with 1.5. More supported JDKs == more potential users, so
I
like to see a real reason to bump the JDK level.
But using concurrency features in "Commons Math" would also
contradict
your own point ("we should maintain a single-threaded
version"): I
agree,
and that's why I proposed this other project...
As for efficiency (or faster execution, if you want), I don't
see the
point in doubting that tasks like global search (e.g. in a
genetic
algorithm) will complete in less time when run in parallel...
As I summarized previously, having a "Commons Math MT" would
bring no
inconvenience, contrary to either your points 0, 1, or 2. [No
inconvenience to me, that is, but to people with requirements
like
"Java 5 compatible" or "no multi-threading").
As I indicated, the basic "task" could be defined in "Commons
Math" and
"Commons Math MT" would provide the parallelization "glue" (e.g.
to divide
the search space of the GA).
I think it is best at this point to cut a branch and actually
start
working on specific algorithms. Having a set of candidate
algorithms for parallelization will help us decide what we
actually
need and how it might work. I would personally favor the 4.x
approach, with thread-spawning behavior configurable.
It seems fair to wait until parallel algorithms are actually
implemented.
However it is not clear what you mean with "the 4.x approach": if
it is
actually allowing Java 7, that would mean that, starting from 4.0,
we'll
indeed drop support of earlier JVMs!
Why would this be preferred to having 2 projects? Of course, if
everyone
agrees to that move to Java 7, that's fine. :-)
What I meant was that instead of creating a new component, we would
just create a new release line. Like what tomcat does for servlet
spec versions. I guess this does mean that we end up having to
stabilize the 3.x APIs because no additional "major" release would
be allowed in that line. That would be a *good thing* IMO as long
as we can do it cleanly. If not, maybe we end up having to use 5.x
for the JDK 1.7+ version, using 4.0 to get to a stable API for the
current trunk code.
There's a still the human resource problem: we don't have it to
maintain
a single branch; having two will only make it worse.
Yes, but the "new project" approach has the same problem.
Yes.
However, I meant it as a way to separate concerns, as shown
by diverging opinions, even in the few people who take part
in this discussion or in previous ones about the same subject.
A sibling (not separate!) project could allow interested
people to experiment while not adding yet another "distraction"
to the main project, where people more focused on the
mathematical (for lack of a better word) side can continue
their own improvements.
A healthy interaction could even come out of having a "public"
use-case in the form of a project that needs certain facilities
(algorithms as tasks) in order to provide multi-thread
utilities to users (who might prefer not to have to implement
them themselves at a higher level).
On the other hand, if we keep Java 5, at least until we get use
cases or
contributions that would benefit from features in JDKs newer than
1.5,
there is no need to create a branch; we can just go on with adding
multi-thread codes to the trunk (to become part[1] of the upcoming
3.x
releases).
That is why I wanted to get a feel for what the JDK 1.7 stuff
really
buys you. Has anyone seen benchmarks showing better performance
using 1.7 than can be obtained just using 1.5 concurrency
primitives?
Again, there are separate issues:
1. Coding in Java 7
2. Running with the JVM shipped with JDK 1.7
The newer JVMs are faster, independently of whether new features
of the
language are used.
But it could well be that some of the new features allow even better
performance (as is foreseen for Java 8).
Agreed. I am interested in understanding better both how much
easier it actually is to code and whether the 1.7 framework
materially improves scheduling / allocation over what you could do
just using 1.5 primitives.
I cannot provide proof, but nor is anyone on this list
eager to prove the contrary; hence the proposal to set
up a "playground".
Has anyone used 1.7 to parallelize numerical algorithms
and found it really easier / more performant?
Where are those people who could answer?
This is a public list :)
That is one of the points I raised. If we maintain source
compatibility
with a language version that is 9 years old, not many contributors
are
going to be interested. Thus reducing the chance to get answers...
Any opinions /
responses to Konstantin's comment on where parallelization should
be
implemented - i.e. in the library vs somewhere up the stack?
What was the _question_? ...
The question he implicitly raised was whether or not it makes sense
for a low-level library to parallelize tasks / run across cores.
In several areas, CM is not a low-level library (GA, multi-start
optimizers for example). In other areas like FFT, a user can
legitimately expect top performance without having to handle
parallelization by himself.
This is a legitimate question. It may be better actually to set
things up so that higher-level frameworks or applications can
arrange parallel execution rather than embedding it in the low-level
library itself. This is also what I was referring to when I said
that in some contexts, thread-spawning / cpu hogging may not be
desirable.
For several cases (GA, FFT, multi-start optimizers), I have the
opposite viewpoint: multi-threading is a implementation detail,
that could be handled at a _lower_ level. Of course, the user can
decide whether to enable more than one thread.
Any
ideas how to set things up so that [math] code can play nicely with
concurrency frameworks?
That's a strange question in the context of a project that tries
hard
not to have any dependency.
I did not mean necessarily to bring in dependencies; but rather to
make it easy for computational tasks executed by [math] code to be
managed by external concurrency frameworks, e.g. Hadoop.
In the context of Commons Math, we often heard that "no dependency"
is good. Then, it is also good to not impose _implicit_ dependencies
(like: "If you use Hadoop, you could have better performance"). In a
way, the CM development "model" is: "We provide a toolkit of efficient
procedures, and you, the user, get top performance (on a best effort
basis of course)."
If we can provide better performance through multi-threading, why not?
Nobody will be forced to use it: they will use the "basic" (sequential)
tasks, or set the "parallelizationLevel" setting to 1.
Gilles
Phil
If the requirement is to only depend on the standard JDK: the
framework
is in
java.util.concurrent
and all we need to do is to define "tasks" that can be "submitted to
an executor:
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/AbstractExecutorService.html#submit(java.util.concurrent.Callable)
Regards,
Gilles
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org