Re: [Bioc-devel] BiocParallel

Tim Triche, Jr. Thu, 15 Nov 2012 11:03:36 -0800

Personally, having used memcached in the past for distributed shared memory
caching, I am most interested in 3) and doRedis.  Many cluster/batch
processing systems are a colossal PITA, and a worker queue would go a long
way towards fixing that.  Less checkpointing, more results... I hope.


As an aside, if I want to do the following:

        ol <- findOverlaps(object, x)
        so <- object[queryHits(ol)]
        sx <- x[subjectHits(ol)]
        disjoint <- subsetByOverlaps(disjoin(c(sx, so, ignore.mcols = T)),
            so)
        mcols(disjoint)[, "state"] <- rep("", length(disjoint))
        byState <- split(so, mcols(so)[, "state"])
        for (state in names(byState)) {
            overlapping <- queryHits(findOverlaps(disjoint,
byState[[state]]))
            if (length(overlapping) > 0) mcols(disjoint[overlapping])[,
"state"] <- state
        }
        mcols(disjoint)[, "state"] <- as.factor(mcols(disjoint)[, "state"])

often and fast, where 'object' and 'x' are ranges with large numbers of
intervals, is there a clever way to speed it up a lot?




On Thu, Nov 15, 2012 at 10:53 AM, Henrik Bengtsson <h...@biostat.ucsf.edu>wrote:

> Is there any write up/discussion/plans on the various types of
> parallel computations out there:
>
> (1) one machine / multi-core/multi-threaded
> (2) multiple machines / multiple processes
> (3) batch / queue processing (on large compute clusters with many users).
> (4) ...
>
> Are we/you mainly focusing on (1) and (2)?
>
> /Henrik
>
> On Thu, Nov 15, 2012 at 6:21 AM, Kasper Daniel Hansen
> <kasperdanielhan...@gmail.com> wrote:
> > I'll second Ryan's patch (at least in principle).  When I parallelize
> > across multiple cores, I have always found mc.preschedule to be an
> > important option to expose (that, and the number of cores, is all I
> > use routinely).
> >
> > Kasper
> >
> > On Wed, Nov 14, 2012 at 7:14 PM, Ryan C. Thompson <r...@thompsonclan.org>
> wrote:
> >> I just submitted a pull request. I'll add tests shortly if I can figure
> out
> >> how to write them.
> >>
> >>
> >> On Wed 14 Nov 2012 03:50:36 PM PST, Martin Morgan wrote:
> >>>
> >>> On 11/14/2012 03:43 PM, Ryan C. Thompson wrote:
> >>>>
> >>>> Here are two alternative implementations of pvec. pvec2 is just a
> >>>> simple rewrite
> >>>> of pvec to use mclapply. pvec3 then extends pvec2 to accept a
> >>>> specified chunk
> >>>> size or a specified number of chunks. If the number of chunks exceeds
> >>>> the number
> >>>> of cores, then multiple chunks will get run sequentially on each
> >>>> core. pvec3
> >>>> also exposes the "mc.prescheule" argument of mclapply, since that is
> >>>> relevant
> >>>> when there are more chunks than cores. Lastly, I provide a
> >>>> "pvectorize" function
> >>>> which can be called on a regular vectorized function to make it into
> >>>> a pvec'd
> >>>> version of itself. Usage is like: sqrt.parallel <- pvectorize(sqrt);
> >>>> sqrt.parallel(1:1000).
> >>>>
> >>>> pvec2 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
> >>>>                    mc.cores = getOption("mc.cores", 2L), mc.cleanup =
> >>>> TRUE)
> >>>> {
> >>>>    env <- parent.frame()
> >>>>    cores <- as.integer(mc.cores)
> >>>>    if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>>    if(cores == 1L) return(FUN(v, ...))
> >>>>
> >>>>    if(mc.set.seed) mc.reset.stream()
> >>>>
> >>>>    n <- length(v)
> >>>>    si <- splitIndices(n, cores)
> >>>>    res <- do.call(c,
> >>>>                   mclapply(si, function(i) FUN(v[i], ...),
> >>>>                            mc.set.seed=mc.set.seed,
> >>>>                            mc.silent=mc.silent,
> >>>>                            mc.cores=mc.cores,
> >>>>                            mc.cleanup=mc.cleanup))
> >>>>    if (length(res) != n)
> >>>>      warning("some results may be missing, folded or caused an error")
> >>>>    res
> >>>> }
> >>>> pvec3 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
> >>>>                    mc.cores = getOption("mc.cores", 2L), mc.cleanup =
> >>>> TRUE,
> >>>>                    mc.preschedule=FALSE, num.chunks, chunk.size)
> >>>> {
> >>>>    env <- parent.frame()
> >>>>    cores <- as.integer(mc.cores)
> >>>>    if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>>    if(cores == 1L) return(FUN(v, ...))
> >>>>
> >>>>    if(mc.set.seed) mc.reset.stream()
> >>>>
> >>>>    n <- length(v)
> >>>>    if (missing(num.chunks)) {
> >>>>      if (missing(chunk.size)) {
> >>>>        num.chunks <- cores
> >>>>      } else {
> >>>>        num.chunks <- ceiling(n/chunk.size)
> >>>>      }
> >>>>    }
> >>>>    si <- splitIndices(n, num.chunks)
> >>>>    res <- do.call(c,
> >>>>                   mclapply(si, function(i) FUN(v[i], ...),
> >>>>                            mc.set.seed=mc.set.seed,
> >>>>                            mc.silent=mc.silent,
> >>>>                            mc.cores=mc.cores,
> >>>>                            mc.cleanup=mc.cleanup,
> >>>>                            mc.preschedule=mc.preschedule))
> >>>>    if (length(res) != n)
> >>>>      warning("some results may be missing, folded or caused an error")
> >>>>    res
> >>>> }
> >>>>
> >>>> pvectorize <- function(FUN) {
> >>>>    function(...) pvec3(FUN=FUN, ...)
> >>>> }
> >>>
> >>>
> >>> would be great to have these as 'pull' requests in github; pvec3 as a
> >>> replacement for pvec, if it's implementing the same concept but better.
> >>>
> >>> Unit tests would be good (yes being a little hypocritical).
> >>> inst/unitTests, using RUnit, examples in
> >>>
> >>>
> >>>
> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/IRanges/inst/unitTests
> >>>
> >>>
> >>> with username / password readonly
> >>>
> >>> Martin
> >>>
> >>>> On Wed 14 Nov 2012 02:23:30 PM PST, Michael Lawrence wrote:
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 14, 2012 at 12:23 PM, Martin Morgan <mtmor...@fhcrc.org>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> Interested developers -- I added the start of a BiocParallel
> >>>>>> package to
> >>>>>> the Bioconductor subversion repository and build system.
> >>>>>>
> >>>>>> The package is mirrored on github to allow for social coding; I
> >>>>>> encourage
> >>>>>> people to contribute via that mechanism.
> >>>>>>
> >>>>>>
> >>>>>> https://github.com/**Bioconductor/BiocParallel<
> https://github.com/Bioconductor/BiocParallel>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> The purpose is to help focus our efforts at developing appropriate
> >>>>>> parallel paradigms. Currently the package Imports: parallel and
> >>>>>> implements
> >>>>>> pvec and mclapply in a way that allows for operation on any vector
> >>>>>> or list
> >>>>>> supporting length(), [, and [[ (the latter for mclapply). pvec in
> >>>>>> particular seems to be appropriate for GRanges-like objects, where
> >>>>>> we don't
> >>>>>> necessarily want to extract many thousands of S4 instances of
> >>>>>> individual
> >>>>>> ranges with [[.
> >>>>>>
> >>>>>
> >>>>>
> >>>>> Makes sense. Besides, [[ does not even work on GRanges. One
> >>>>> limitation of
> >>>>> pvec is that it does not support a chunk size; it just uses
> length(x) /
> >>>>> ncores. It would be nice to be able to restrict that, which would
> then
> >>>>> require multiple jobs per core. Unless I'm missing something.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hopefully the ideas in the package can be folded back in to
> >>>>>> parallel as
> >>>>>> they mature.
> >>>>>>
> >>>>>> Martin
> >>>>>> --
> >>>>>> Dr. Martin Morgan, PhD
> >>>>>> Fred Hutchinson Cancer Research Center
> >>>>>> 1100 Fairview Ave. N.
> >>>>>> PO Box 19024 Seattle, WA 98109
> >>>>>>
> >>>>>> ______________________________**_________________
> >>>>>> Bioc-devel@r-project.org mailing list
> >>>>>>
> >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> [[alternative HTML version deleted]]
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioc-devel@r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>
> >>>
> >>>
> >>
> >> _______________________________________________
> >> Bioc-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> > _______________________________________________
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



-- 
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] BiocParallel

Reply via email to