Re: [Bioc-devel] BiocParallel

Sean Davis Fri, 16 Nov 2012 07:53:29 -0800

On Fri, Nov 16, 2012 at 10:33 AM, Hahne, Florian <florian.ha...@novartis.com
> wrote:


> Sort of. My implementation assumes parLapply to be a generic function, and
> there is an object called SGEcluster, which in a way is equivalent to the
> 'cluster' class objects in the parallel package. Rather than providing a
> bunch of nodes to compute on, it contains the necessary information for
> BatchJobs to run the parallel processes on an SGE cluster (could make use
> of other queuing systems too, I just don't have one around to test this).
> Essentially this is a path to a shared file system for all the
> serialization, a queue name and an optional list off additional
> requirements. Currently this is still rather rudimentary, and I typically
> just have the system create as many jobs as there are items in the vector
> object that gets passed to parLapply, but obviously one could make use of
> this sharding option in the BatchJobs package to mimic mc.preschedule in
> the mclapply equivalent.
> The reason why I went for the parLapply re-implementation is because I
> wanted this to play nicely with existing tools, hence my request to create
> a generic in the parallel package to allow for dispatching on my clusterSGE
> class. But since we seem to aim for something more generic anyways I could
> just turn this into yet another new backend.
> Martin: how would I add my code to this GitHub repository? I have to admit
> that I am a bit of a GitHub virginâ¦
>

Hi, Florian.

This might be helpful.

https://help.github.com/articles/fork-a-repo

Sean




> Florian
> --
>
>
> From: Michael Lawrence <lawrence.mich...@gene.com<mailto:
> lawrence.mich...@gene.com>>
> Date: Friday, November 16, 2012 3:00 PM
> To: NIBR <florian.ha...@novartis.com<mailto:florian.ha...@novartis.com>>
> Cc: Michael Lawrence <lawrence.mich...@gene.com<mailto:
> lawrence.mich...@gene.com>>, Martin Morgan <mtmor...@fhcrc.org<mailto:
> mtmor...@fhcrc.org>>, "bioc-devel@r-project.org<mailto:
> bioc-devel@r-project.org>" <bioc-devel@r-project.org<mailto:
> bioc-devel@r-project.org>>
> Subject: Re: [Bioc-devel] BiocParallel
>
> This sounds very useful when mixing batch jobs with an interactive
> session. In fact, it's something I was planning to do, since I noticed
> their execution model is completely asynchronous. Is it actually a new
> cluster backend for the parallel package?
>
> Michael
>
> On Fri, Nov 16, 2012 at 12:18 AM, Hahne, Florian <
> florian.ha...@novartis.com<mailto:florian.ha...@novartis.com>> wrote:
> I've hacked up some code that uses BatchJobs but makes it look like a
> normal parLapply operation. Currently the main R process is checking the
> state of the queue in regular intervals and fetches results once a job has
> finished. Seems to work quite nicely, although there certainly are more
> elaborate ways to deal with the synchronous/asynchronous issue. Is that
> something that could be interesting for the broader audience? I could add
> the code to BiocParallel for folks to try it out.
> The whole thing may be a dumb idea, but I find it kind of useful to be
> able to start parallel jobs directly from R on our huge SGE cluster, have
> the calling script wait for all jobs to finish and then continue with some
> downstream computations, rather than having to manually check the job
> status and start another script once the results are there.
> Florian
> --
>
>
>
>
>
>
> On 11/15/12 9:38 PM, "Michael Lawrence" <lawrence.mich...@gene.com<mailto:
> lawrence.mich...@gene.com>> wrote:
>
> >On Thu, Nov 15, 2012 at 11:00 AM, Martin Morgan <mtmor...@fhcrc.org
> <mailto:mtmor...@fhcrc.org>>
> >wrote:
> >
> >> On 11/15/2012 10:53 AM, Henrik Bengtsson wrote:
> >>
> >>> Is there any write up/discussion/plans on the various types of
> >>> parallel computations out there:
> >>>
> >>> (1) one machine / multi-core/multi-threaded
> >>> (2) multiple machines / multiple processes
> >>> (3) batch / queue processing (on large compute clusters with many
> >>>users).
> >>> (4) ...
> >>>
> >>> Are we/you mainly focusing on (1) and (2)?
> >>>
> >>
> >> open for discussion; 1 & 2 are a good starting point for current scope.
> >> r-pbd.org<http://r-pbd.org> is relevant for 3.
> >>
> >>
> >We have all three of those configurations here, so I've been looking into
> >ways to facilitate each of them. One interesting package is BatchJobs. It
> >handles simple clusters via ssh, as well as large managed clusters via
> >e.g.
> >lsf.
> >
> >
> >
> >> Not sure how to best facilitate this conversation / prioritization on
> >> github? if possible we should move the conversation there.
> >>
> >> Martin
> >>
> >>
> >>
> >>> /Henrik
> >>>
> >>> On Thu, Nov 15, 2012 at 6:21 AM, Kasper Daniel Hansen
> >>> <kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>>
> wrote:
> >>>
> >>>> I'll second Ryan's patch (at least in principle).  When I parallelize
> >>>> across multiple cores, I have always found mc.preschedule to be an
> >>>> important option to expose (that, and the number of cores, is all I
> >>>> use routinely).
> >>>>
> >>>> Kasper
> >>>>
> >>>> On Wed, Nov 14, 2012 at 7:14 PM, Ryan C. Thompson
> >>>><r...@thompsonclan.org<mailto:r...@thompsonclan.org>>
> >>>> wrote:
> >>>>
> >>>>> I just submitted a pull request. I'll add tests shortly if I can
> >>>>>figure
> >>>>> out
> >>>>> how to write them.
> >>>>>
> >>>>>
> >>>>> On Wed 14 Nov 2012 03:50:36 PM PST, Martin Morgan wrote:
> >>>>>
> >>>>>>
> >>>>>> On 11/14/2012 03:43 PM, Ryan C. Thompson wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Here are two alternative implementations of pvec. pvec2 is just a
> >>>>>>> simple rewrite
> >>>>>>> of pvec to use mclapply. pvec3 then extends pvec2 to accept a
> >>>>>>> specified chunk
> >>>>>>> size or a specified number of chunks. If the number of chunks
> >>>>>>>exceeds
> >>>>>>> the number
> >>>>>>> of cores, then multiple chunks will get run sequentially on each
> >>>>>>> core. pvec3
> >>>>>>> also exposes the "mc.prescheule" argument of mclapply, since that
> >>>>>>>is
> >>>>>>> relevant
> >>>>>>> when there are more chunks than cores. Lastly, I provide a
> >>>>>>> "pvectorize" function
> >>>>>>> which can be called on a regular vectorized function to make it
> >>>>>>>into
> >>>>>>> a pvec'd
> >>>>>>> version of itself. Usage is like: sqrt.parallel <-
> >>>>>>>pvectorize(sqrt);
> >>>>>>> sqrt.parallel(1:1000).
> >>>>>>>
> >>>>>>> pvec2 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent =
> >>>>>>>FALSE,
> >>>>>>>                     mc.cores = getOption("mc.cores", 2L),
> >>>>>>>mc.cleanup =
> >>>>>>> TRUE)
> >>>>>>> {
> >>>>>>>     env <- parent.frame()
> >>>>>>>     cores <- as.integer(mc.cores)
> >>>>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>>>>>     if(cores == 1L) return(FUN(v, ...))
> >>>>>>>
> >>>>>>>     if(mc.set.seed) mc.reset.stream()
> >>>>>>>
> >>>>>>>     n <- length(v)
> >>>>>>>     si <- splitIndices(n, cores)
> >>>>>>>     res <- do.call(c,
> >>>>>>>                    mclapply(si, function(i) FUN(v[i], ...),
> >>>>>>>                             mc.set.seed=mc.set.seed,
> >>>>>>>                             mc.silent=mc.silent,
> >>>>>>>                             mc.cores=mc.cores,
> >>>>>>>                             mc.cleanup=mc.cleanup))
> >>>>>>>     if (length(res) != n)
> >>>>>>>       warning("some results may be missing, folded or caused an
> >>>>>>> error")
> >>>>>>>     res
> >>>>>>> }
> >>>>>>> pvec3 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent =
> >>>>>>>FALSE,
> >>>>>>>                     mc.cores = getOption("mc.cores", 2L),
> >>>>>>>mc.cleanup =
> >>>>>>> TRUE,
> >>>>>>>                     mc.preschedule=FALSE, num.chunks, chunk.size)
> >>>>>>> {
> >>>>>>>     env <- parent.frame()
> >>>>>>>     cores <- as.integer(mc.cores)
> >>>>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>>>>>     if(cores == 1L) return(FUN(v, ...))
> >>>>>>>
> >>>>>>>     if(mc.set.seed) mc.reset.stream()
> >>>>>>>
> >>>>>>>     n <- length(v)
> >>>>>>>     if (missing(num.chunks)) {
> >>>>>>>       if (missing(chunk.size)) {
> >>>>>>>         num.chunks <- cores
> >>>>>>>       } else {
> >>>>>>>         num.chunks <- ceiling(n/chunk.size)
> >>>>>>>       }
> >>>>>>>     }
> >>>>>>>     si <- splitIndices(n, num.chunks)
> >>>>>>>     res <- do.call(c,
> >>>>>>>                    mclapply(si, function(i) FUN(v[i], ...),
> >>>>>>>                             mc.set.seed=mc.set.seed,
> >>>>>>>                             mc.silent=mc.silent,
> >>>>>>>                             mc.cores=mc.cores,
> >>>>>>>                             mc.cleanup=mc.cleanup,
> >>>>>>>                             mc.preschedule=mc.preschedule)**)
> >>>>>>>     if (length(res) != n)
> >>>>>>>       warning("some results may be missing, folded or caused an
> >>>>>>> error")
> >>>>>>>     res
> >>>>>>> }
> >>>>>>>
> >>>>>>> pvectorize <- function(FUN) {
> >>>>>>>     function(...) pvec3(FUN=FUN, ...)
> >>>>>>> }
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> would be great to have these as 'pull' requests in github; pvec3 as
> >>>>>>a
> >>>>>> replacement for pvec, if it's implementing the same concept but
> >>>>>>better.
> >>>>>>
> >>>>>> Unit tests would be good (yes being a little hypocritical).
> >>>>>> inst/unitTests, using RUnit, examples in
> >>>>>>
> >>>>>>
> >>>>>> https://hedgehog.fhcrc.org/**bioconductor/trunk/madman/**
> >>>>>>
> >>>>>>Rpacks/IRanges/inst/unitTests<
> https://hedgehog.fhcrc.org/bioconductor
> >>>>>>/trunk/madman/Rpacks/IRanges/inst/unitTests>
> >>>>>>
> >>>>>>
> >>>>>> with username / password readonly
> >>>>>>
> >>>>>> Martin
> >>>>>>
> >>>>>>  On Wed 14 Nov 2012 02:23:30 PM PST, Michael Lawrence wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Nov 14, 2012 at 12:23 PM, Martin Morgan
> >>>>>>>><mtmor...@fhcrc.org<mailto:mtmor...@fhcrc.org>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Interested developers -- I added the start of a BiocParallel
> >>>>>>>>> package to
> >>>>>>>>> the Bioconductor subversion repository and build system.
> >>>>>>>>>
> >>>>>>>>> The package is mirrored on github to allow for social coding; I
> >>>>>>>>> encourage
> >>>>>>>>> people to contribute via that mechanism.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>https://github.com/****Bioconductor/BiocParallel<
> https://github.co
> >>>>>>>>>m/**Bioconductor/BiocParallel>
> >>>>>>>>>
> >>>>>>>>><http**s://github.com/Bioconductor/**BiocParallel<
> http://github.com/Bioconductor/**BiocParallel><https://github.c
> >>>>>>>>>om/Bioconductor/BiocParallel>
> >>>>>>>>> >
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The purpose is to help focus our efforts at developing
> >>>>>>>>>appropriate
> >>>>>>>>> parallel paradigms. Currently the package Imports: parallel and
> >>>>>>>>> implements
> >>>>>>>>> pvec and mclapply in a way that allows for operation on any
> >>>>>>>>>vector
> >>>>>>>>> or list
> >>>>>>>>> supporting length(), [, and [[ (the latter for mclapply). pvec in
> >>>>>>>>> particular seems to be appropriate for GRanges-like objects,
> >>>>>>>>>where
> >>>>>>>>> we don't
> >>>>>>>>> necessarily want to extract many thousands of S4 instances of
> >>>>>>>>> individual
> >>>>>>>>> ranges with [[.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Makes sense. Besides, [[ does not even work on GRanges. One
> >>>>>>>> limitation of
> >>>>>>>> pvec is that it does not support a chunk size; it just uses
> >>>>>>>> length(x) /
> >>>>>>>> ncores. It would be nice to be able to restrict that, which would
> >>>>>>>> then
> >>>>>>>> require multiple jobs per core. Unless I'm missing something.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hopefully the ideas in the package can be folded back in to
> >>>>>>>>> parallel as
> >>>>>>>>> they mature.
> >>>>>>>>>
> >>>>>>>>> Martin
> >>>>>>>>> --
> >>>>>>>>> Dr. Martin Morgan, PhD
> >>>>>>>>> Fred Hutchinson Cancer Research Center
> >>>>>>>>> 1100 Fairview Ave. N.
> >>>>>>>>> PO Box 19024 Seattle, WA 98109
> >>>>>>>>>
> >>>>>>>>> ______________________________****_________________
> >>>>>>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
> mailing list
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat
> .
> >>>>>>>>>ethz.ch/mailman/**listinfo/bioc-devel<
> http://ethz.ch/mailman/**listinfo/bioc-devel>>
> >>>>>>>>>
> >>>>>>>>><https://**stat.ethz.ch/mailman/listinfo/**bioc-devel<
> http://stat.ethz.ch/mailman/listinfo/**bioc-devel><https://stat
> >>>>>>>>>.ethz.ch/mailman/listinfo/bioc-devel<
> http://ethz.ch/mailman/listinfo/bioc-devel>>
> >>>>>>>>> >
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> [[alternative HTML version deleted]]
> >>>>>>>>
> >>>>>>>> ______________________________**_________________
> >>>>>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
> mailing list
> >>>>>>>>
> >>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
> https://stat.eth
> >>>>>>>>z.ch/mailman/listinfo/bioc-devel<
> http://z.ch/mailman/listinfo/bioc-devel>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> ______________________________**_________________
> >>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing
> list
> >>>>>
> >>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
> https://stat.ethz.c
> >>>>>h/mailman/listinfo/bioc-devel>
> >>>>>
> >>>>
> >>>> ______________________________**_________________
> >>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing
> list
> >>>>
> >>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
> https://stat.ethz.ch
> >>>>/mailman/listinfo/bioc-devel>
> >>>>
> >>>
> >>> ______________________________**_________________
> >>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
> >>>
> >>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
> https://stat.ethz.ch/
> >>>mailman/listinfo/bioc-devel>
> >>>
> >>>
> >>
> >> --
> >> Dr. Martin Morgan, PhD
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
> >https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
>         [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] BiocParallel

Reply via email to