I think that's a good idea, Kylie. Pete (DelayedMatrixStats developer) On Thu., 2 Nov. 2017, 6:09 am Kasper Daniel Hansen, < kasperdanielhan...@gmail.com> wrote:
> I think it makes sense. A lot of sense. Might be useful to involve Henrik > (matrixStats) as well. > > Who are the players, apart from DelayedArray/DelayedMatrixStats and matter? > (and some very old stuff in Biobase which should really be deprecated in > favor of matrixStats). > > Best, > Kasper > > On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.be...@northeastern.edu> > wrote: > > > Hi all, > > > > To continue a variant of this conversation, with the latest BioC release, > > we now have quite a few packages that are implementing various > > matrix-related S4 generic functions, many of them relying on matrixStats > as > > a template. > > > > I was wondering if there is any interest or intention to create a common > > MatrixGenerics/ArrayGenerics package on which we can depend to import the > > relevant S4 generic functions. Although BiocGeneric has a few like > > ‘rowSums()’ and ‘colMeans()’, etc., there are many more that are > > implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package > > ‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so > forth. > > > > It would be nice to have a single package with minimal additional > > dependencies (a la BiocGenerics) where we could import the various S4 > > generics and avoid unwanted namespace collisions. > > > > Have there been any thoughts on this? > > > > Many thanks, > > Kylie > > > > ~~~ > > Kylie Ariel Bemis > > Future Faculty Fellow > > College of Computer and Information Science > > Northeastern University > > kuwisdelu.github.io<https://kuwisdelu.github.io> > > > > > > > > > > On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen < > > kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>> > wrote: > > > > > > > > On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey < > st...@channing.harvard.edu > > <mailto:st...@channing.harvard.edu>> wrote: > > > > > > On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen < > > kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>> > wrote: > > Some comment on Aaron's stuff > > > > One possibility for doing things like this is if your code can be done in > > C++ using a subset of rows or columns. That can sometimes give the > > necessary speed up. What I mean is this > > > > Say you can safely process 1000 cells (not matrix cells, but biological > > cells, aka columns) at a time in RAM > > > > iterate in R: > > get chunk i containing 1000 cells from the backend data storage > > do something on this sub matrix where everything is in a normal matrix > > and you just use C++ > > write results out to whatever backend you're using > > > > Then, with a million cells you iterate over 1000 chunks in R. And you > > don't need to "touch" the full dataset which can be stored on an > arbitrary > > backend. > > > > you "touch" it, but you never ingest the whole thing at any time, is that > > what you mean? > > > > Yes, you load the chunk into RAM and then just deal with it. > > > > Think of doing 10^10 linear models. If this was 10^6 I would just use > > lmFit. But 10^10 doesn't fit into memory. So I load 10^7 into memory, > run > > lmFit, store results, redo. This is bound to be much more efficient than > > loading a single row into memory and doing lm 10^10 times, because lmFit > is > > written to do many linear models at the same time. > > > > I am suggesting that this is a potential general strategy. > > > > > > And this approach could be run even (potentially) with different chunks > on > > different nodes. > > > > that seems to me to be an important if not essential desideratum. > > > > what then is the role of C++? extracting a chunk? preexisting > utilities? > > > > When I say C++ I just mean write an efficient implementation that works > on > > a chunk, like lmFit. It is true that anything that works on a chunk will > > work on a single row/column (like lmFit) but there are possibilities for > > optimization when you work at the chunk level. > > > > Obviously not all computations can be done chunkwise. But for those that > > can, this is a strategy which is independent of the data backend. > > > > I wonder whether this "obviously not" needs to be rethought. Algorithms > > that are implemented to work with data holistically may need > > to be reexpressed so that they can succeed with chunkwise access. Is > this > > a new mindset needed for holist developers, or can the > > effective data decompositions occur autonomously? > > > > Well, I would say it is obvious that not all computations can be done > > chunkwise. But of course, in the limit of extremely large data, > algorithms > > which needs to cycle over everything no longer scale. So in that case > all > > practical computations can be done chunkwise, out of necessity. For > single > > cell right now where it is just millions of cells on the horizon people > > will think that they can get "standard" holistic approaches to work (and > > that is probably true). If they had a billion cells they probably > wouldn't > > think about that. > > > > Kasper > > > > If you need direct access to the data in the backend in C++ it will be > > extremely backend dependent what is fast and how to do it. That doesn't > > mean we shouldn't do it though. > > > > Best, > > Kasper > > > > > > > > On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey < > st...@channing.harvard.edu< > > mailto:st...@channing.harvard.edu>> wrote: > > Kylie, thanks for reminding us of matter -- I saw you speak about this at > > the first Bioconductor Boston Meetup, but it > > went like lightning. For developers contemplating an approach to > > representing high-volume rectangular data, > > where there is no dominant legacy format, it is natural to wonder whether > > HDF5 would be adequate, and, > > further, to wonder how to demonstrate that it is or is not dominated by > > some other approach for a given set > > of tasks. Should we devise a set of bioinformatic benchmark problems to > > foster comparison and informed > > decisionmaking? @becker.gabe: is ALTREP far enough along that one could > > contemplate benchmarking with it? > > > > On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.be...@northeastern.edu< > > mailto:k.be...@northeastern.edu>> > > wrote: > > > > > It’s not there yet, but I plan to expose a C++ API for my disk-backed > > > matrix objects in the next version of my ‘matter’ package. > > > > > > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc. > > > objects at the R level, especially if using a frontend like > DelayedArray > > on > > > top of them, but it would be nice to have a common C++ API that I could > > > hook into as well (a la Rcpp), so new C/C++ could be re-used across > > various > > > backends more easily. > > > > > > Kylie > > > > > > ~~~ > > > Kylie Ariel Bemis > > > Future Faculty Fellow > > > College of Computer and Information Science > > > Northeastern University > > > kuwisdelu.github.io<http://kuwisdelu.github.io/><https:// > > kuwisdelu.github.io<https://kuwisdelu.github.io/>> > > > > > > > > > > > > > > > On Feb 24, 2017, at 4:50 PM, Aaron Lun <a...@wehi.edu.au<mailto:alun@ > > wehi.edu.au><mailto:alun@<mailto:alun@> > > > wehi.edu.au<http://wehi.edu.au/>>> wrote: > > > > > > It's a good place to start, though it would be very handy to have a > C(++) > > > API that can be linked against. I'm not sure how much work that would > > > entail but it would give downstream developers a lot more options. Sort > > of > > > like how we can link to Rhtslib, which speeds up a lot of BAM file > > > processing, instead of just relying on Rsamtools. > > > > > > > > > -Aaron > > > > > > ________________________________ > > > From: Tim Triche, Jr. <tim.tri...@gmail.com<mailto: > tim.tri...@gmail.com > > ><mailto:tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>> > > > Sent: Saturday, 25 February 2017 8:34:58 AM > > > To: Aaron Lun > > > Cc: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org><mailto: > > bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>> > > > Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package? > > > > > > yes > > > > > > the DelayedArray framework that handles HDF5Array, etc. seems like the > > > right choice? > > > > > > --t > > > > > > On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <a...@wehi.edu.au<mailto: > > a...@wehi.edu.au><mailto:alun@<mailto:alun@> > > > wehi.edu.au<http://wehi.edu.au/>><mailto:a...@wehi.edu.au<mailto: > > a...@wehi.edu.au>>> wrote: > > > Hi everyone, > > > > > > I just attended the Human Cell Atlas meeting in Stanford, and people > were > > > talking about gene expression matrices for >1 million cells. If we > assume > > > that we can get non-zero expression profiles for ~5000 genes, we�d be > > > talking about a 5000 x 1 million matrix for the raw count data. This > > would > > > be 20-40 GB in size, which would clearly benefit from sparse (via > Matrix) > > > or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5, > etc.). > > > > > > I�m wondering whether there is any appetite amongst us for making a > > > consistent BioC API to handle these matrices, sort of like what > > > BiocParallel does for multicore and snow. It goes without saying that > the > > > different matrix representations should have consistent functions at > the > > R > > > level (rbind/cbind, etc.) but it would also be nice to have an > integrated > > > C/C++ API (accessible via LinkedTo). There�s many non-trivial things > that > > > can be done with this type of data, and it is often faster and more > > memory > > > efficient to do these complex operations in compiled code. > > > > > > I was thinking of something that you could supply any supported matrix > > > representation to a registered function via .Call; the C++ constructor > > > would recognise the type of matrix during class instantiation; and > > > operations (row/column/random read access, also possibly various ways > of > > > writing a matrix) would be overloaded and behave as required for the > > class. > > > Only the implementation of the API would need to care about the nitty > > > gritty of each representation, and we would all be free to write code > > that > > > actually does the interesting analytical stuff. > > > > > > Anyway, just throwing some thoughts out there. Any comments > appreciated. > > > > > > Cheers, > > > > > > Aaron > > > > > > [[alternative HTML version deleted]] > > > > > > > > > _______________________________________________ > > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org><mailto: > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>><mailto: > > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>> mailing > list > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioc-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel