I think it makes sense. A lot of sense. Might be useful to involve Henrik (matrixStats) as well.
Who are the players, apart from DelayedArray/DelayedMatrixStats and matter? (and some very old stuff in Biobase which should really be deprecated in favor of matrixStats). Best, Kasper On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.be...@northeastern.edu> wrote: > Hi all, > > To continue a variant of this conversation, with the latest BioC release, > we now have quite a few packages that are implementing various > matrix-related S4 generic functions, many of them relying on matrixStats as > a template. > > I was wondering if there is any interest or intention to create a common > MatrixGenerics/ArrayGenerics package on which we can depend to import the > relevant S4 generic functions. Although BiocGeneric has a few like > ‘rowSums()’ and ‘colMeans()’, etc., there are many more that are > implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package > ‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so forth. > > It would be nice to have a single package with minimal additional > dependencies (a la BiocGenerics) where we could import the various S4 > generics and avoid unwanted namespace collisions. > > Have there been any thoughts on this? > > Many thanks, > Kylie > > ~~~ > Kylie Ariel Bemis > Future Faculty Fellow > College of Computer and Information Science > Northeastern University > kuwisdelu.github.io<https://kuwisdelu.github.io> > > > > > On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen < > kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>> wrote: > > > > On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <st...@channing.harvard.edu > <mailto:st...@channing.harvard.edu>> wrote: > > > On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen < > kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>> wrote: > Some comment on Aaron's stuff > > One possibility for doing things like this is if your code can be done in > C++ using a subset of rows or columns. That can sometimes give the > necessary speed up. What I mean is this > > Say you can safely process 1000 cells (not matrix cells, but biological > cells, aka columns) at a time in RAM > > iterate in R: > get chunk i containing 1000 cells from the backend data storage > do something on this sub matrix where everything is in a normal matrix > and you just use C++ > write results out to whatever backend you're using > > Then, with a million cells you iterate over 1000 chunks in R. And you > don't need to "touch" the full dataset which can be stored on an arbitrary > backend. > > you "touch" it, but you never ingest the whole thing at any time, is that > what you mean? > > Yes, you load the chunk into RAM and then just deal with it. > > Think of doing 10^10 linear models. If this was 10^6 I would just use > lmFit. But 10^10 doesn't fit into memory. So I load 10^7 into memory, run > lmFit, store results, redo. This is bound to be much more efficient than > loading a single row into memory and doing lm 10^10 times, because lmFit is > written to do many linear models at the same time. > > I am suggesting that this is a potential general strategy. > > > And this approach could be run even (potentially) with different chunks on > different nodes. > > that seems to me to be an important if not essential desideratum. > > what then is the role of C++? extracting a chunk? preexisting utilities? > > When I say C++ I just mean write an efficient implementation that works on > a chunk, like lmFit. It is true that anything that works on a chunk will > work on a single row/column (like lmFit) but there are possibilities for > optimization when you work at the chunk level. > > Obviously not all computations can be done chunkwise. But for those that > can, this is a strategy which is independent of the data backend. > > I wonder whether this "obviously not" needs to be rethought. Algorithms > that are implemented to work with data holistically may need > to be reexpressed so that they can succeed with chunkwise access. Is this > a new mindset needed for holist developers, or can the > effective data decompositions occur autonomously? > > Well, I would say it is obvious that not all computations can be done > chunkwise. But of course, in the limit of extremely large data, algorithms > which needs to cycle over everything no longer scale. So in that case all > practical computations can be done chunkwise, out of necessity. For single > cell right now where it is just millions of cells on the horizon people > will think that they can get "standard" holistic approaches to work (and > that is probably true). If they had a billion cells they probably wouldn't > think about that. > > Kasper > > If you need direct access to the data in the backend in C++ it will be > extremely backend dependent what is fast and how to do it. That doesn't > mean we shouldn't do it though. > > Best, > Kasper > > > > On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <st...@channing.harvard.edu< > mailto:st...@channing.harvard.edu>> wrote: > Kylie, thanks for reminding us of matter -- I saw you speak about this at > the first Bioconductor Boston Meetup, but it > went like lightning. For developers contemplating an approach to > representing high-volume rectangular data, > where there is no dominant legacy format, it is natural to wonder whether > HDF5 would be adequate, and, > further, to wonder how to demonstrate that it is or is not dominated by > some other approach for a given set > of tasks. Should we devise a set of bioinformatic benchmark problems to > foster comparison and informed > decisionmaking? @becker.gabe: is ALTREP far enough along that one could > contemplate benchmarking with it? > > On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.be...@northeastern.edu< > mailto:k.be...@northeastern.edu>> > wrote: > > > It’s not there yet, but I plan to expose a C++ API for my disk-backed > > matrix objects in the next version of my ‘matter’ package. > > > > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc. > > objects at the R level, especially if using a frontend like DelayedArray > on > > top of them, but it would be nice to have a common C++ API that I could > > hook into as well (a la Rcpp), so new C/C++ could be re-used across > various > > backends more easily. > > > > Kylie > > > > ~~~ > > Kylie Ariel Bemis > > Future Faculty Fellow > > College of Computer and Information Science > > Northeastern University > > kuwisdelu.github.io<http://kuwisdelu.github.io/><https:// > kuwisdelu.github.io<https://kuwisdelu.github.io/>> > > > > > > > > > > On Feb 24, 2017, at 4:50 PM, Aaron Lun <a...@wehi.edu.au<mailto:alun@ > wehi.edu.au><mailto:alun@<mailto:alun@> > > wehi.edu.au<http://wehi.edu.au/>>> wrote: > > > > It's a good place to start, though it would be very handy to have a C(++) > > API that can be linked against. I'm not sure how much work that would > > entail but it would give downstream developers a lot more options. Sort > of > > like how we can link to Rhtslib, which speeds up a lot of BAM file > > processing, instead of just relying on Rsamtools. > > > > > > -Aaron > > > > ________________________________ > > From: Tim Triche, Jr. <tim.tri...@gmail.com<mailto:tim.tri...@gmail.com > ><mailto:tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>> > > Sent: Saturday, 25 February 2017 8:34:58 AM > > To: Aaron Lun > > Cc: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org><mailto: > bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>> > > Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package? > > > > yes > > > > the DelayedArray framework that handles HDF5Array, etc. seems like the > > right choice? > > > > --t > > > > On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <a...@wehi.edu.au<mailto: > a...@wehi.edu.au><mailto:alun@<mailto:alun@> > > wehi.edu.au<http://wehi.edu.au/>><mailto:a...@wehi.edu.au<mailto: > a...@wehi.edu.au>>> wrote: > > Hi everyone, > > > > I just attended the Human Cell Atlas meeting in Stanford, and people were > > talking about gene expression matrices for >1 million cells. If we assume > > that we can get non-zero expression profiles for ~5000 genes, we�d be > > talking about a 5000 x 1 million matrix for the raw count data. This > would > > be 20-40 GB in size, which would clearly benefit from sparse (via Matrix) > > or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5, etc.). > > > > I�m wondering whether there is any appetite amongst us for making a > > consistent BioC API to handle these matrices, sort of like what > > BiocParallel does for multicore and snow. It goes without saying that the > > different matrix representations should have consistent functions at the > R > > level (rbind/cbind, etc.) but it would also be nice to have an integrated > > C/C++ API (accessible via LinkedTo). There�s many non-trivial things that > > can be done with this type of data, and it is often faster and more > memory > > efficient to do these complex operations in compiled code. > > > > I was thinking of something that you could supply any supported matrix > > representation to a registered function via .Call; the C++ constructor > > would recognise the type of matrix during class instantiation; and > > operations (row/column/random read access, also possibly various ways of > > writing a matrix) would be overloaded and behave as required for the > class. > > Only the implementation of the API would need to care about the nitty > > gritty of each representation, and we would all be free to write code > that > > actually does the interesting analytical stuff. > > > > Anyway, just throwing some thoughts out there. Any comments appreciated. > > > > Cheers, > > > > Aaron > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org><mailto: > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>><mailto: > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>> mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel