Re: [Bioc-devel] BiocParallel -- update

Michael Lawrence Tue, 04 Dec 2012 13:16:01 -0800

On Tue, Dec 4, 2012 at 12:47 PM, Ryan C. Thompson <r...@thompsonclan.org>wrote:


> One issue that I see is that for some kinds of parallel backends, there
> may not be any way for "bpworkers" to return something meaningful. For
> example, a backend that submits jobs to a large cluster may not know
> exactly how many nodes are in the cluster, and in any case returning the
> total number of nodes may not be appropriate, since those nodes are shared
> with other cluster users. This is primarily important for the pvec
> function, which uses the result of bpworkers to decide how many chunks to
> split the input into.
>
>
I guess one solution is to make sure that for any backend that cannot
> natively determine a number of available workers, we require the number of
> workers as an argument when creating the param object for that backend.
> e.g.:
>
> param <- IndeterminateSizedClusterParam**(workers=50).
>
>
I think this is on the right track. Since the nature of the request affects
how the jobs are scheduled (earlier or later), there's no way to
automatically make the decision, even if we could detect the total cluster
size. As I noted in the previous email, having a consistent means of
specifying resource requests across backends would be helpful.

I could see an API like:

request <- ResourceRequest(num.cores = 5)
cluster <- LSFCluster(request) # or MulticoreCluster(request)
pvec(v, cluster = cluster)

Depending on the cluster, the 'cluster' object could be queried for whether
the requested resources are currently available (or the jobs will need to
wait). A default cluster object could be registered in options(). The
Cluster constructors could take the arguments of ResourceRequest directly
for simple tasks.

Then the question is whether pvec returns the result of evaluation, or the
promise of evaluation. Probably best to have pvec always behave
synchronously, then have variants like apvec() for asynchronous execution.
The promise would be backend-specific and support status queries. For
multicore, this is basically mcparallel/mccollect.

Additionally, as discussed previously, it makes sense to be able to
> explicitly choose a chunk size or number of chunks for pvec, rather than
> splitting into exactly as many chunks as there are parallel workers. I
> implemented this in the non-generic multicore-only version of pvec, but I
> still need to port it to the generic version that works for any param. Do
> people think that the chunk options should be included in the
> MulticoreParam class, or specified when pvec is called?
>
>
What about supporting both? If passed directly to pvec, the params option
is overridden.


> I have also written a non-generic multicore-only version of pvectorize
> that allows for multiple vectorized arguments instead of just one, and
> furthermore gives the parallelized function an identical signature to the
> original function. Again, this needs to be ported to the generic
> bpvectorize.
>
>
Awesome.

Michael


>
> ______________________________**_________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] BiocParallel -- update

Reply via email to