> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maech...@stat.math.ethz.ch> > wrote: > >>>>>> C W <tmrs...@gmail.com> >>>>>> on Fri, 20 Oct 2017 15:51:16 -0400 writes: > >> Thank you for your responses. I guess I don't feel >> alone. I don't find the documentation go into any detail. > >> I also find it surprising that, > >>> object.size(train$data) >> 1730904 bytes > >>> object.size(as.matrix(train$data)) >> 6575016 bytes > >> the dgCMatrix actually takes less memory, though it >> *looks* like the opposite. > > to whom? > > The whole idea of these sparse matrix classes in the 'Matrix' > package (and everywhere else in applied math, CS, ...) is that > 1. they need much less memory and > 2. matrix arithmetic with them can be much faster because it is based on > sophisticated sparse matrix linear algebra, notably the > sparse Cholesky decomposition for solve() etc. > > Of course the efficency only applies if most of the > matrix entries _are_ 0. > You can measure the "sparsity" or rather the "density", of a > matrix by > > nnzero(A) / length(A) > > where length(A) == nrow(A) * ncol(A) as for regular matrices > (but it does *not* integer overflow) > and nnzero(.) is a simple utility from Matrix > which -- very efficiently for sparseMatrix objects -- gives the > number of nonzero entries of the matrix. > > All of these classes are formally defined classes and have > therefore help pages. Here ?dgCMatrix-class which then points > to ?CsparseMatrix-class (and I forget if Rstudio really helps > you find these ..; in emacs ESS they are found nicely via the usual key) > > To get started, you may further look at ?Matrix _and_ ?sparseMatrix > (and possibly the Matrix package vignettes --- though they need > work -- I'm happy for collaborators there !) > > Bill Dunlap's comment applies indeed: > In principle all these matrices should work like regular numeric > matrices, just faster with less memory foot print if they are > really sparse (and not just formally of a sparseMatrix class) > ((and there are quite a few more niceties in the package)) > > Martin Maechler > (here, maintainer of 'Matrix') > > >> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsem...@comcast.net> >> wrote: > >>>> On Oct 20, 2017, at 11:11 AM, C W <tmrs...@gmail.com> wrote: >>>> >>>> Dear R list, >>>> >>>> I came across dgCMatrix. I believe this class is associated with sparse >>>> matrix. >>> >>> Yes. See: >>> >>> help('dgCMatrix-class', pack=Matrix) >>> >>> If Martin Maechler happens to respond to this you should listen to him >>> rather than anything I write. Much of what the Matrix package does appears >>> to be magical to one such as I. >>> >>>> >>>> I see there are 8 attributes to train$data, I am confused why are there >>> so >>>> many, some are vectors, what do they do? >>>> >>>> Here's the R code: >>>> >>>> library(xgboost) >>>> data(agaricus.train, package='xgboost') >>>> data(agaricus.test, package='xgboost') >>>> train <- agaricus.train >>>> test <- agaricus.test >>>> attributes(train$data) >>>> >>> >>> I got a bit of an annoying surprise when I did something similar. It >>> appearred to me that I did not need to load the xgboost library since all >>> that was being asked was "where is the data" in an object that should be >>> loaded from that library using the `data` function. The last command asking >>> for the attributes filled up my console with a 100K length vector (actually >>> 2 of such vectors). The `str` function returns a more useful result. >>> >>>> data(agaricus.train, package='xgboost') >>>> train <- agaricus.train >>>> names( attributes(train$data) ) >>> [1] "i" "p" "Dim" "Dimnames" "x" "factors" >>> "class" >>>> str(train$data) >>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots >>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... >>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 >>> ... >>> ..@ Dim : int [1:2] 6513 126 >>> ..@ Dimnames:List of 2 >>> .. ..$ : NULL >>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" >>> "cap-shape=convex" "cap-shape=flat" ... >>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... >>> ..@ factors : list() >>> >>>> Where is the data, is it in $p, $i, or $x? >>> >>> So the "data" (meaning the values of the sparse matrix) are in the @x >>> leaf. The values all appear to be the number 1. The @i leaf is the sequence >>> of row locations for the values entries while the @p items are somehow >>> connected with the columns (I think, since 127 and 126=number of columns >>> from the @Dim leaf are only off by 1). > > You are right David. > > well, they follow sparse matrix standards which (like C) start > counting at 0. > >>> >>> Doing this > colSums(as.matrix(train$data)) > > The above colSums() again is "very" inefficient: > All such R functions have smartly defined Matrix methods that > directly work on sparse matrices.
I did get an error with colSums(train$data) > colSums(train$data) Error in colSums(train$data) : 'x' must be an array of at least two dimensions Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps the xgboost package only imports certain functions from pkg:Matrix and that colSums is not one of them. This resembles the errors I get when I try to use grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid I always am surprised when this happens and after a headslap and explicitly loading pfk:grid I continue on my stumbling way. library(Matrix) colSums(train$data) # no error > Note that as.matrix(M) can "blow up" your R, when the matrix M > is really large and sparse such that its dense version does not > even fit in your computer's RAM. I did know that, so I first calculated whether the dense matrix version of that object would fit in my RAM space and it fit easily so I proceeded. I find the TsparseMatrix indexing easier for my more naive notion of sparsity, although thinking about it now, I think I can see that the CsparseMatrix more closely resembles the "folded vector" design of dense R matrices. I will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the "inner" indices. I should probably stop doing that. I sincerely hope my stumbling efforts have not caused any delays. -- David. > >>> cap-shape=bell cap-shape=conical >>> 369 3 >>> cap-shape=convex cap-shape=flat >>> 2934 2539 >>> cap-shape=knobbed cap-shape=sunken >>> 644 24 >>> cap-surface=fibrous cap-surface=grooves >>> 1867 4 >>> cap-surface=scaly cap-surface=smooth >>> 2607 2035 >>> cap-color=brown cap-color=buff >>> 1816 >>> # now snipping the rest of that output. >>> >>> >>> >>> Now this makes me think that the @p vector gives you the cumulative sum of >>> number of items per column: >>> >>>> all( cumsum( colSums(as.matrix(train$data)) ) == train$data@p[-1] ) >>> [1] TRUE >>> >>>> >>>> Thank you very much! >>>> >>>> [[alternative HTML version deleted]] >>> >>> Please read the Posting Guide. Your code was not mangled in this instance, >>> but HTML code often arrives in an unreadable mess. >>> >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> David Winsemius >>> Alameda, CA, USA >>> >>> 'Any technology distinguishable from magic is insufficiently advanced.' >>> -Gehm's Corollary to Clarke's Third Law >>> >>> >>> >>> >>> >>> > >> [[alternative HTML version deleted]] > >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.