On Fri, Nov 16, 2012 at 10:33 AM, Hahne, Florian <florian.ha...@novartis.com > wrote:
> Sort of. My implementation assumes parLapply to be a generic function, and > there is an object called SGEcluster, which in a way is equivalent to the > 'cluster' class objects in the parallel package. Rather than providing a > bunch of nodes to compute on, it contains the necessary information for > BatchJobs to run the parallel processes on an SGE cluster (could make use > of other queuing systems too, I just don't have one around to test this). > Essentially this is a path to a shared file system for all the > serialization, a queue name and an optional list off additional > requirements. Currently this is still rather rudimentary, and I typically > just have the system create as many jobs as there are items in the vector > object that gets passed to parLapply, but obviously one could make use of > this sharding option in the BatchJobs package to mimic mc.preschedule in > the mclapply equivalent. > The reason why I went for the parLapply re-implementation is because I > wanted this to play nicely with existing tools, hence my request to create > a generic in the parallel package to allow for dispatching on my clusterSGE > class. But since we seem to aim for something more generic anyways I could > just turn this into yet another new backend. > Martin: how would I add my code to this GitHub repository? I have to admit > that I am a bit of a GitHub virgin⦠> Hi, Florian. This might be helpful. https://help.github.com/articles/fork-a-repo Sean > Florian > -- > > > From: Michael Lawrence <lawrence.mich...@gene.com<mailto: > lawrence.mich...@gene.com>> > Date: Friday, November 16, 2012 3:00 PM > To: NIBR <florian.ha...@novartis.com<mailto:florian.ha...@novartis.com>> > Cc: Michael Lawrence <lawrence.mich...@gene.com<mailto: > lawrence.mich...@gene.com>>, Martin Morgan <mtmor...@fhcrc.org<mailto: > mtmor...@fhcrc.org>>, "bioc-devel@r-project.org<mailto: > bioc-devel@r-project.org>" <bioc-devel@r-project.org<mailto: > bioc-devel@r-project.org>> > Subject: Re: [Bioc-devel] BiocParallel > > This sounds very useful when mixing batch jobs with an interactive > session. In fact, it's something I was planning to do, since I noticed > their execution model is completely asynchronous. Is it actually a new > cluster backend for the parallel package? > > Michael > > On Fri, Nov 16, 2012 at 12:18 AM, Hahne, Florian < > florian.ha...@novartis.com<mailto:florian.ha...@novartis.com>> wrote: > I've hacked up some code that uses BatchJobs but makes it look like a > normal parLapply operation. Currently the main R process is checking the > state of the queue in regular intervals and fetches results once a job has > finished. Seems to work quite nicely, although there certainly are more > elaborate ways to deal with the synchronous/asynchronous issue. Is that > something that could be interesting for the broader audience? I could add > the code to BiocParallel for folks to try it out. > The whole thing may be a dumb idea, but I find it kind of useful to be > able to start parallel jobs directly from R on our huge SGE cluster, have > the calling script wait for all jobs to finish and then continue with some > downstream computations, rather than having to manually check the job > status and start another script once the results are there. > Florian > -- > > > > > > > On 11/15/12 9:38 PM, "Michael Lawrence" <lawrence.mich...@gene.com<mailto: > lawrence.mich...@gene.com>> wrote: > > >On Thu, Nov 15, 2012 at 11:00 AM, Martin Morgan <mtmor...@fhcrc.org > <mailto:mtmor...@fhcrc.org>> > >wrote: > > > >> On 11/15/2012 10:53 AM, Henrik Bengtsson wrote: > >> > >>> Is there any write up/discussion/plans on the various types of > >>> parallel computations out there: > >>> > >>> (1) one machine / multi-core/multi-threaded > >>> (2) multiple machines / multiple processes > >>> (3) batch / queue processing (on large compute clusters with many > >>>users). > >>> (4) ... > >>> > >>> Are we/you mainly focusing on (1) and (2)? > >>> > >> > >> open for discussion; 1 & 2 are a good starting point for current scope. > >> r-pbd.org<http://r-pbd.org> is relevant for 3. > >> > >> > >We have all three of those configurations here, so I've been looking into > >ways to facilitate each of them. One interesting package is BatchJobs. It > >handles simple clusters via ssh, as well as large managed clusters via > >e.g. > >lsf. > > > > > > > >> Not sure how to best facilitate this conversation / prioritization on > >> github? if possible we should move the conversation there. > >> > >> Martin > >> > >> > >> > >>> /Henrik > >>> > >>> On Thu, Nov 15, 2012 at 6:21 AM, Kasper Daniel Hansen > >>> <kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>> > wrote: > >>> > >>>> I'll second Ryan's patch (at least in principle). When I parallelize > >>>> across multiple cores, I have always found mc.preschedule to be an > >>>> important option to expose (that, and the number of cores, is all I > >>>> use routinely). > >>>> > >>>> Kasper > >>>> > >>>> On Wed, Nov 14, 2012 at 7:14 PM, Ryan C. Thompson > >>>><r...@thompsonclan.org<mailto:r...@thompsonclan.org>> > >>>> wrote: > >>>> > >>>>> I just submitted a pull request. I'll add tests shortly if I can > >>>>>figure > >>>>> out > >>>>> how to write them. > >>>>> > >>>>> > >>>>> On Wed 14 Nov 2012 03:50:36 PM PST, Martin Morgan wrote: > >>>>> > >>>>>> > >>>>>> On 11/14/2012 03:43 PM, Ryan C. Thompson wrote: > >>>>>> > >>>>>>> > >>>>>>> Here are two alternative implementations of pvec. pvec2 is just a > >>>>>>> simple rewrite > >>>>>>> of pvec to use mclapply. pvec3 then extends pvec2 to accept a > >>>>>>> specified chunk > >>>>>>> size or a specified number of chunks. If the number of chunks > >>>>>>>exceeds > >>>>>>> the number > >>>>>>> of cores, then multiple chunks will get run sequentially on each > >>>>>>> core. pvec3 > >>>>>>> also exposes the "mc.prescheule" argument of mclapply, since that > >>>>>>>is > >>>>>>> relevant > >>>>>>> when there are more chunks than cores. Lastly, I provide a > >>>>>>> "pvectorize" function > >>>>>>> which can be called on a regular vectorized function to make it > >>>>>>>into > >>>>>>> a pvec'd > >>>>>>> version of itself. Usage is like: sqrt.parallel <- > >>>>>>>pvectorize(sqrt); > >>>>>>> sqrt.parallel(1:1000). > >>>>>>> > >>>>>>> pvec2 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = > >>>>>>>FALSE, > >>>>>>> mc.cores = getOption("mc.cores", 2L), > >>>>>>>mc.cleanup = > >>>>>>> TRUE) > >>>>>>> { > >>>>>>> env <- parent.frame() > >>>>>>> cores <- as.integer(mc.cores) > >>>>>>> if(cores < 1L) stop("'mc.cores' must be >= 1") > >>>>>>> if(cores == 1L) return(FUN(v, ...)) > >>>>>>> > >>>>>>> if(mc.set.seed) mc.reset.stream() > >>>>>>> > >>>>>>> n <- length(v) > >>>>>>> si <- splitIndices(n, cores) > >>>>>>> res <- do.call(c, > >>>>>>> mclapply(si, function(i) FUN(v[i], ...), > >>>>>>> mc.set.seed=mc.set.seed, > >>>>>>> mc.silent=mc.silent, > >>>>>>> mc.cores=mc.cores, > >>>>>>> mc.cleanup=mc.cleanup)) > >>>>>>> if (length(res) != n) > >>>>>>> warning("some results may be missing, folded or caused an > >>>>>>> error") > >>>>>>> res > >>>>>>> } > >>>>>>> pvec3 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = > >>>>>>>FALSE, > >>>>>>> mc.cores = getOption("mc.cores", 2L), > >>>>>>>mc.cleanup = > >>>>>>> TRUE, > >>>>>>> mc.preschedule=FALSE, num.chunks, chunk.size) > >>>>>>> { > >>>>>>> env <- parent.frame() > >>>>>>> cores <- as.integer(mc.cores) > >>>>>>> if(cores < 1L) stop("'mc.cores' must be >= 1") > >>>>>>> if(cores == 1L) return(FUN(v, ...)) > >>>>>>> > >>>>>>> if(mc.set.seed) mc.reset.stream() > >>>>>>> > >>>>>>> n <- length(v) > >>>>>>> if (missing(num.chunks)) { > >>>>>>> if (missing(chunk.size)) { > >>>>>>> num.chunks <- cores > >>>>>>> } else { > >>>>>>> num.chunks <- ceiling(n/chunk.size) > >>>>>>> } > >>>>>>> } > >>>>>>> si <- splitIndices(n, num.chunks) > >>>>>>> res <- do.call(c, > >>>>>>> mclapply(si, function(i) FUN(v[i], ...), > >>>>>>> mc.set.seed=mc.set.seed, > >>>>>>> mc.silent=mc.silent, > >>>>>>> mc.cores=mc.cores, > >>>>>>> mc.cleanup=mc.cleanup, > >>>>>>> mc.preschedule=mc.preschedule)**) > >>>>>>> if (length(res) != n) > >>>>>>> warning("some results may be missing, folded or caused an > >>>>>>> error") > >>>>>>> res > >>>>>>> } > >>>>>>> > >>>>>>> pvectorize <- function(FUN) { > >>>>>>> function(...) pvec3(FUN=FUN, ...) > >>>>>>> } > >>>>>>> > >>>>>> > >>>>>> > >>>>>> would be great to have these as 'pull' requests in github; pvec3 as > >>>>>>a > >>>>>> replacement for pvec, if it's implementing the same concept but > >>>>>>better. > >>>>>> > >>>>>> Unit tests would be good (yes being a little hypocritical). > >>>>>> inst/unitTests, using RUnit, examples in > >>>>>> > >>>>>> > >>>>>> https://hedgehog.fhcrc.org/**bioconductor/trunk/madman/** > >>>>>> > >>>>>>Rpacks/IRanges/inst/unitTests< > https://hedgehog.fhcrc.org/bioconductor > >>>>>>/trunk/madman/Rpacks/IRanges/inst/unitTests> > >>>>>> > >>>>>> > >>>>>> with username / password readonly > >>>>>> > >>>>>> Martin > >>>>>> > >>>>>> On Wed 14 Nov 2012 02:23:30 PM PST, Michael Lawrence wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Nov 14, 2012 at 12:23 PM, Martin Morgan > >>>>>>>><mtmor...@fhcrc.org<mailto:mtmor...@fhcrc.org>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>>> Interested developers -- I added the start of a BiocParallel > >>>>>>>>> package to > >>>>>>>>> the Bioconductor subversion repository and build system. > >>>>>>>>> > >>>>>>>>> The package is mirrored on github to allow for social coding; I > >>>>>>>>> encourage > >>>>>>>>> people to contribute via that mechanism. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>https://github.com/****Bioconductor/BiocParallel< > https://github.co > >>>>>>>>>m/**Bioconductor/BiocParallel> > >>>>>>>>> > >>>>>>>>><http**s://github.com/Bioconductor/**BiocParallel< > http://github.com/Bioconductor/**BiocParallel><https://github.c > >>>>>>>>>om/Bioconductor/BiocParallel> > >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> The purpose is to help focus our efforts at developing > >>>>>>>>>appropriate > >>>>>>>>> parallel paradigms. Currently the package Imports: parallel and > >>>>>>>>> implements > >>>>>>>>> pvec and mclapply in a way that allows for operation on any > >>>>>>>>>vector > >>>>>>>>> or list > >>>>>>>>> supporting length(), [, and [[ (the latter for mclapply). pvec in > >>>>>>>>> particular seems to be appropriate for GRanges-like objects, > >>>>>>>>>where > >>>>>>>>> we don't > >>>>>>>>> necessarily want to extract many thousands of S4 instances of > >>>>>>>>> individual > >>>>>>>>> ranges with [[. > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> Makes sense. Besides, [[ does not even work on GRanges. One > >>>>>>>> limitation of > >>>>>>>> pvec is that it does not support a chunk size; it just uses > >>>>>>>> length(x) / > >>>>>>>> ncores. It would be nice to be able to restrict that, which would > >>>>>>>> then > >>>>>>>> require multiple jobs per core. Unless I'm missing something. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Hopefully the ideas in the package can be folded back in to > >>>>>>>>> parallel as > >>>>>>>>> they mature. > >>>>>>>>> > >>>>>>>>> Martin > >>>>>>>>> -- > >>>>>>>>> Dr. Martin Morgan, PhD > >>>>>>>>> Fred Hutchinson Cancer Research Center > >>>>>>>>> 1100 Fairview Ave. N. > >>>>>>>>> PO Box 19024 Seattle, WA 98109 > >>>>>>>>> > >>>>>>>>> ______________________________****_________________ > >>>>>>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> > mailing list > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat > . > >>>>>>>>>ethz.ch/mailman/**listinfo/bioc-devel< > http://ethz.ch/mailman/**listinfo/bioc-devel>> > >>>>>>>>> > >>>>>>>>><https://**stat.ethz.ch/mailman/listinfo/**bioc-devel< > http://stat.ethz.ch/mailman/listinfo/**bioc-devel><https://stat > >>>>>>>>>.ethz.ch/mailman/listinfo/bioc-devel< > http://ethz.ch/mailman/listinfo/bioc-devel>> > >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> [[alternative HTML version deleted]] > >>>>>>>> > >>>>>>>> ______________________________**_________________ > >>>>>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> > mailing list > >>>>>>>> > >>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel< > https://stat.eth > >>>>>>>>z.ch/mailman/listinfo/bioc-devel< > http://z.ch/mailman/listinfo/bioc-devel>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> ______________________________**_________________ > >>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing > list > >>>>> > >>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel< > https://stat.ethz.c > >>>>>h/mailman/listinfo/bioc-devel> > >>>>> > >>>> > >>>> ______________________________**_________________ > >>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing > list > >>>> > >>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel< > https://stat.ethz.ch > >>>>/mailman/listinfo/bioc-devel> > >>>> > >>> > >>> ______________________________**_________________ > >>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > >>> > >>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel< > https://stat.ethz.ch/ > >>>mailman/listinfo/bioc-devel> > >>> > >>> > >> > >> -- > >> Dr. Martin Morgan, PhD > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N. > >> PO Box 19024 Seattle, WA 98109 > >> > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > >https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel