Re: [Bioc-devel] BiocParallel -- update

Ryan C. Thompson Tue, 04 Dec 2012 13:45:20 -0800

By the way, all my work on BiocParallel is going to end up here:https://github.com/DarwinAwardWinner/BiocParallel

If you want to read through the multicore-only pvectorize, it is here:https://github.com/DarwinAwardWinner/BiocParallel/blob/a3699cf/R/pvectorize.R

It's a little more than one line of code now. A lot of the code dealswith proper recycling in the case of multiple vectorized args andmerging the singature of the function with that of pvec, as well ascorner cases like not vectorizing anything and being passed length-1vectors.

There's also an mcmapply function that is to mapply as mclapply is tolapply. I plan to implement a param-generic version called bpmapply,which may become the backend for bpvectorize.


On Tue 04 Dec 2012 01:15:24 PM PST, Michael Lawrence wrote:




On Tue, Dec 4, 2012 at 12:47 PM, Ryan C. Thompson
<r...@thompsonclan.org <mailto:r...@thompsonclan.org>> wrote:

    One issue that I see is that for some kinds of parallel backends,
    there may not be any way for "bpworkers" to return something
    meaningful. For example, a backend that submits jobs to a large
    cluster may not know exactly how many nodes are in the cluster,
    and in any case returning the total number of nodes may not be
    appropriate, since those nodes are shared with other cluster
    users. This is primarily important for the pvec function, which
    uses the result of bpworkers to decide how many chunks to split
    the input into.

    I guess one solution is to make sure that for any backend that
    cannot natively determine a number of available workers, we
    require the number of workers as an argument when creating the
    param object for that backend. e.g.:

    param <- IndeterminateSizedClusterParam__(workers=50).


I think this is on the right track. Since the nature of the request
affects how the jobs are scheduled (earlier or later), there's no way
to automatically make the decision, even if we could detect the total
cluster size. As I noted in the previous email, having a consistent
means of specifying resource requests across backends would be helpful.

I could see an API like:

request <- ResourceRequest(num.cores = 5)
cluster <- LSFCluster(request) # or MulticoreCluster(request)
pvec(v, cluster = cluster)

Depending on the cluster, the 'cluster' object could be queried for
whether the requested resources are currently available (or the jobs
will need to wait). A default cluster object could be registered in
options(). The Cluster constructors could take the arguments of
ResourceRequest directly for simple tasks.

Then the question is whether pvec returns the result of evaluation, or
the promise of evaluation. Probably best to have pvec always behave
synchronously, then have variants like apvec() for asynchronous
execution. The promise would be backend-specific and support status
queries. For multicore, this is basically mcparallel/mccollect.

    Additionally, as discussed previously, it makes sense to be able
    to explicitly choose a chunk size or number of chunks for pvec,
    rather than splitting into exactly as many chunks as there are
    parallel workers. I implemented this in the non-generic
    multicore-only version of pvec, but I still need to port it to the
    generic version that works for any param. Do people think that the
    chunk options should be included in the MulticoreParam class, or
    specified when pvec is called?


What about supporting both? If passed directly to pvec, the params
option is overridden.

    I have also written a non-generic multicore-only version of
    pvectorize that allows for multiple vectorized arguments instead
    of just one, and furthermore gives the parallelized function an
    identical signature to the original function. Again, this needs to
    be ported to the generic bpvectorize.


Awesome.

Michael


    _________________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
    list
    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] BiocParallel -- update

Reply via email to