By the way, all my work on BiocParallel is going to end up here:
https://github.com/DarwinAwardWinner/BiocParallel
If you want to read through the multicore-only pvectorize, it is here:
https://github.com/DarwinAwardWinner/BiocParallel/blob/a3699cf/R/pvectorize.R
It's a little more than one line of code now. A lot of the code deals
with proper recycling in the case of multiple vectorized args and
merging the singature of the function with that of pvec, as well as
corner cases like not vectorizing anything and being passed length-1
vectors.
There's also an mcmapply function that is to mapply as mclapply is to
lapply. I plan to implement a param-generic version called bpmapply,
which may become the backend for bpvectorize.
On Tue 04 Dec 2012 01:15:24 PM PST, Michael Lawrence wrote:
On Tue, Dec 4, 2012 at 12:47 PM, Ryan C. Thompson
<r...@thompsonclan.org <mailto:r...@thompsonclan.org>> wrote:
One issue that I see is that for some kinds of parallel backends,
there may not be any way for "bpworkers" to return something
meaningful. For example, a backend that submits jobs to a large
cluster may not know exactly how many nodes are in the cluster,
and in any case returning the total number of nodes may not be
appropriate, since those nodes are shared with other cluster
users. This is primarily important for the pvec function, which
uses the result of bpworkers to decide how many chunks to split
the input into.
I guess one solution is to make sure that for any backend that
cannot natively determine a number of available workers, we
require the number of workers as an argument when creating the
param object for that backend. e.g.:
param <- IndeterminateSizedClusterParam__(workers=50).
I think this is on the right track. Since the nature of the request
affects how the jobs are scheduled (earlier or later), there's no way
to automatically make the decision, even if we could detect the total
cluster size. As I noted in the previous email, having a consistent
means of specifying resource requests across backends would be helpful.
I could see an API like:
request <- ResourceRequest(num.cores = 5)
cluster <- LSFCluster(request) # or MulticoreCluster(request)
pvec(v, cluster = cluster)
Depending on the cluster, the 'cluster' object could be queried for
whether the requested resources are currently available (or the jobs
will need to wait). A default cluster object could be registered in
options(). The Cluster constructors could take the arguments of
ResourceRequest directly for simple tasks.
Then the question is whether pvec returns the result of evaluation, or
the promise of evaluation. Probably best to have pvec always behave
synchronously, then have variants like apvec() for asynchronous
execution. The promise would be backend-specific and support status
queries. For multicore, this is basically mcparallel/mccollect.
Additionally, as discussed previously, it makes sense to be able
to explicitly choose a chunk size or number of chunks for pvec,
rather than splitting into exactly as many chunks as there are
parallel workers. I implemented this in the non-generic
multicore-only version of pvec, but I still need to port it to the
generic version that works for any param. Do people think that the
chunk options should be included in the MulticoreParam class, or
specified when pvec is called?
What about supporting both? If passed directly to pvec, the params
option is overridden.
I have also written a non-generic multicore-only version of
pvectorize that allows for multiple vectorized arguments instead
of just one, and furthermore gives the parallelized function an
identical signature to the original function. Again, this needs to
be ported to the generic bpvectorize.
Awesome.
Michael
_________________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel