Re: [Bioc-devel] BiocParallel

Martin Morgan Thu, 15 Nov 2012 10:58:50 -0800

On 11/15/2012 6:21 AM, Kasper Daniel Hansen wrote:

I'll second Ryan's patch (at least in principle).  When I parallelize
across multiple cores, I have always found mc.preschedule to be an
important option to expose (that, and the number of cores, is all I
use routinely).

Yes, Ryan provided a pull request and I've incorporated it. His pvectorize isalso quite fun...


  options(mc.cores=8)
  pcountOverlaps <- pvectorize(countOverlaps)

and for a GappedAlignments object with about 10 million ranges

> system.time(yy <- countOverlaps(gal, exByTx, ignore.strand=TRUE))
   user  system elapsed
 79.997   6.941  87.208
> system.time(xx <- pcountOverlaps(gal, exByTx, ignore.strand=TRUE))
   user  system elapsed
717.273  34.443  18.622

expect a fluid interface, though...

git / svn gurus: Ryan gets attribution in the git repo, but using git svndcommit I get all the credit. Tricks?


Martin


Kasper

On Wed, Nov 14, 2012 at 7:14 PM, Ryan C. Thompson <r...@thompsonclan.org> wrote:

I just submitted a pull request. I'll add tests shortly if I can figure out
how to write them.


On Wed 14 Nov 2012 03:50:36 PM PST, Martin Morgan wrote:


On 11/14/2012 03:43 PM, Ryan C. Thompson wrote:


Here are two alternative implementations of pvec. pvec2 is just a
simple rewrite
of pvec to use mclapply. pvec3 then extends pvec2 to accept a
specified chunk
size or a specified number of chunks. If the number of chunks exceeds
the number
of cores, then multiple chunks will get run sequentially on each
core. pvec3
also exposes the "mc.prescheule" argument of mclapply, since that is
relevant
when there are more chunks than cores. Lastly, I provide a
"pvectorize" function
which can be called on a regular vectorized function to make it into
a pvec'd
version of itself. Usage is like: sqrt.parallel <- pvectorize(sqrt);
sqrt.parallel(1:1000).

pvec2 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
                    mc.cores = getOption("mc.cores", 2L), mc.cleanup =
TRUE)
{
    env <- parent.frame()
    cores <- as.integer(mc.cores)
    if(cores < 1L) stop("'mc.cores' must be >= 1")
    if(cores == 1L) return(FUN(v, ...))

    if(mc.set.seed) mc.reset.stream()

    n <- length(v)
    si <- splitIndices(n, cores)
    res <- do.call(c,
                   mclapply(si, function(i) FUN(v[i], ...),
                            mc.set.seed=mc.set.seed,
                            mc.silent=mc.silent,
                            mc.cores=mc.cores,
                            mc.cleanup=mc.cleanup))
    if (length(res) != n)
      warning("some results may be missing, folded or caused an error")
    res
}
pvec3 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
                    mc.cores = getOption("mc.cores", 2L), mc.cleanup =
TRUE,
                    mc.preschedule=FALSE, num.chunks, chunk.size)
{
    env <- parent.frame()
    cores <- as.integer(mc.cores)
    if(cores < 1L) stop("'mc.cores' must be >= 1")
    if(cores == 1L) return(FUN(v, ...))

    if(mc.set.seed) mc.reset.stream()

    n <- length(v)
    if (missing(num.chunks)) {
      if (missing(chunk.size)) {
        num.chunks <- cores
      } else {
        num.chunks <- ceiling(n/chunk.size)
      }
    }
    si <- splitIndices(n, num.chunks)
    res <- do.call(c,
                   mclapply(si, function(i) FUN(v[i], ...),
                            mc.set.seed=mc.set.seed,
                            mc.silent=mc.silent,
                            mc.cores=mc.cores,
                            mc.cleanup=mc.cleanup,
                            mc.preschedule=mc.preschedule))
    if (length(res) != n)
      warning("some results may be missing, folded or caused an error")
    res
}

pvectorize <- function(FUN) {
    function(...) pvec3(FUN=FUN, ...)
}



would be great to have these as 'pull' requests in github; pvec3 as a
replacement for pvec, if it's implementing the same concept but better.

Unit tests would be good (yes being a little hypocritical).
inst/unitTests, using RUnit, examples in


https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/IRanges/inst/unitTests


with username / password readonly

Martin

On Wed 14 Nov 2012 02:23:30 PM PST, Michael Lawrence wrote:



On Wed, Nov 14, 2012 at 12:23 PM, Martin Morgan <mtmor...@fhcrc.org>
wrote:


Interested developers -- I added the start of a BiocParallel
package to
the Bioconductor subversion repository and build system.

The package is mirrored on github to allow for social coding; I
encourage
people to contribute via that mechanism.


https://github.com/**Bioconductor/BiocParallel<https://github.com/Bioconductor/BiocParallel>



The purpose is to help focus our efforts at developing appropriate
parallel paradigms. Currently the package Imports: parallel and
implements
pvec and mclapply in a way that allows for operation on any vector
or list
supporting length(), [, and [[ (the latter for mclapply). pvec in
particular seems to be appropriate for GRanges-like objects, where
we don't
necessarily want to extract many thousands of S4 instances of
individual
ranges with [[.



Makes sense. Besides, [[ does not even work on GRanges. One
limitation of
pvec is that it does not support a chunk size; it just uses length(x) /
ncores. It would be nice to be able to restrict that, which would then
require multiple jobs per core. Unless I'm missing something.



Hopefully the ideas in the package can be folded back in to
parallel as
they mature.

Martin
--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

______________________________**_________________
Bioc-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>



[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] BiocParallel

Reply via email to