>>>>> David Winsemius <dwinsem...@comcast.net> >>>>> on Sat, 21 Oct 2017 09:05:38 -0700 writes:
>> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maech...@stat.math.ethz.ch> wrote: >> >>>>>>> C W <tmrs...@gmail.com> >>>>>>> on Fri, 20 Oct 2017 15:51:16 -0400 writes: >> >>> Thank you for your responses. I guess I don't feel >>> alone. I don't find the documentation go into any detail. >> >>> I also find it surprising that, >> >>>> object.size(train$data) >>> 1730904 bytes >> >>>> object.size(as.matrix(train$data)) >>> 6575016 bytes >> >>> the dgCMatrix actually takes less memory, though it >>> *looks* like the opposite. >> >> to whom? >> >> The whole idea of these sparse matrix classes in the 'Matrix' >> package (and everywhere else in applied math, CS, ...) is that >> 1. they need much less memory and >> 2. matrix arithmetic with them can be much faster because it is based on >> sophisticated sparse matrix linear algebra, notably the >> sparse Cholesky decomposition for solve() etc. >> >> Of course the efficency only applies if most of the >> matrix entries _are_ 0. >> You can measure the "sparsity" or rather the "density", of a >> matrix by >> >> nnzero(A) / length(A) >> >> where length(A) == nrow(A) * ncol(A) as for regular matrices >> (but it does *not* integer overflow) >> and nnzero(.) is a simple utility from Matrix >> which -- very efficiently for sparseMatrix objects -- gives the >> number of nonzero entries of the matrix. >> >> All of these classes are formally defined classes and have >> therefore help pages. Here ?dgCMatrix-class which then points >> to ?CsparseMatrix-class (and I forget if Rstudio really helps >> you find these ..; in emacs ESS they are found nicely via the usual key) >> >> To get started, you may further look at ?Matrix _and_ ?sparseMatrix >> (and possibly the Matrix package vignettes --- though they need >> work -- I'm happy for collaborators there !) >> >> Bill Dunlap's comment applies indeed: >> In principle all these matrices should work like regular numeric >> matrices, just faster with less memory foot print if they are >> really sparse (and not just formally of a sparseMatrix class) >> ((and there are quite a few more niceties in the package)) >> >> Martin Maechler >> (here, maintainer of 'Matrix') >> >> >>> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsem...@comcast.net> >>> wrote: >> >>>>> On Oct 20, 2017, at 11:11 AM, C W <tmrs...@gmail.com> wrote: >>>>> >>>>> Dear R list, >>>>> >>>>> I came across dgCMatrix. I believe this class is associated with sparse >>>>> matrix. >>>> >>>> Yes. See: >>>> >>>> help('dgCMatrix-class', pack=Matrix) >>>> >>>> If Martin Maechler happens to respond to this you should listen to him >>>> rather than anything I write. Much of what the Matrix package does appears >>>> to be magical to one such as I. >>>> [............] >>>>> data(agaricus.train, package='xgboost') >>>>> train <- agaricus.train >>>>> names( attributes(train$data) ) >>>> [1] "i" "p" "Dim" "Dimnames" "x" "factors" >>>> "class" >>>>> str(train$data) >>>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots >>>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... >>>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 >>>> ... >>>> ..@ Dim : int [1:2] 6513 126 >>>> ..@ Dimnames:List of 2 >>>> .. ..$ : NULL >>>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" >>>> "cap-shape=convex" "cap-shape=flat" ... >>>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... >>>> ..@ factors : list() >>>> >>>>> Where is the data, is it in $p, $i, or $x? >>>> >>>> So the "data" (meaning the values of the sparse matrix) are in the @x >>>> leaf. The values all appear to be the number 1. The @i leaf is the sequence >>>> of row locations for the values entries while the @p items are somehow >>>> connected with the columns (I think, since 127 and 126=number of columns >>>> from the @Dim leaf are only off by 1). >> >> You are right David. >> >> well, they follow sparse matrix standards which (like C) start >> counting at 0. >> >>>> >>>> Doing this > colSums(as.matrix(train$data)) >> >> The above colSums() again is "very" inefficient: >> All such R functions have smartly defined Matrix methods that >> directly work on sparse matrices. > I did get an error with colSums(train$data) >> colSums(train$data) > Error in colSums(train$data) : > 'x' must be an array of at least two dimensions The same problem C.W. saw with head() It, e.g., all works after calling str() on train$data. But I am still puzzled, because head() is similar to str(): both are S3 generics (in "utils") but str()'s useMethod() I think see that the class belongs to package "Matrix" and hence attaches it {not just *load* it -- hence, import etc does not matter}. but head() does not. Even more curiously, colSums() *also* attaches Matrix but still fails, but it works on a 2nd call Example 1, in a fresh R session: -------------------------------------------------------------------------------- > data(agaricus.train, package="xgboost") > M <- agaricus.train$data > methods(str) [1] str.data.frame* str.Date* str.default* str.dendrogram* str.logLik* str.POSIXt* # see '?methods' for accessing help and source code > str(M) Loading required package: Matrix <<<<<<<<< SEE ! <<<<<<<<<<<<<<<<< Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ... ..@ Dim : int [1:2] 6513 126 ..@ Dimnames:List of 2 .. ..$ : NULL R .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ... ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... ..@ factors : list() > > head(M) 6 x 126 sparse Matrix of class "dgCMatrix" [[ suppressing 126 column names ‘cap-shape=bell’, ‘cap-shape=conical’, ‘cap-shape=convex’ ... ]] [1,] . . 1 . . . . . . 1 1 . . . . . . . . . 1 . . . . . . . . 1 . . . 1 . 1 . . . 1 1 . . . . . . . . . . . 1 . . ................ ................ ................ --------------------------------------------------------------------------------- - See, str() is a nice one generic function ==> attaches Matrix (see the message where I have added '<<<<<<<<< SEE ! <<<<.........'), but as we know head() does not strangely. Now, the curious colSums() behavior: Example 2, in a fresh R session: ----------------------------------------------------------------------------- > data(agaricus.train, package='xgboost') > M <- agaricus.train$data > cm <- colSums(M) ## first time, loads Matrix but then fails !! Loading required package: Matrix Error in colSums(M) : 'x' must be an array of at least two dimensions > cm <- colSums(M) ## 2nd time, works because Matrix methods are all there > str(cm) Named num [1:126] 369 3 2934 2539 644 ... - attr(*, "names")= chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ... > ----------------------------------------------------------------------------- > Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps the xgboost package only imports certain functions from pkg:Matrix and that colSums is not one of them. This resembles the errors I get when I try to use grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid I always am surprised when this happens and after a headslap and explicitly loading pfk:grid I continue on my stumbling way. > library(Matrix) > colSums(train$data) # no error >> Note that as.matrix(M) can "blow up" your R, when the matrix M >> is really large and sparse such that its dense version does not >> even fit in your computer's RAM. > I did know that, so I first calculated whether the dense matrix version of that object would fit in my RAM space and it fit easily so I proceeded. > I find the TsparseMatrix indexing easier for my more naive notion of sparsity, although thinking about it now, I think I can see that the CsparseMatrix more closely resembles the "folded vector" design of dense R matrices. I will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the "inner" indices. I should probably stop doing that. Well, it depends if speed and efficiency are the only important issues. The triplet representation (<==> TsparseMatrix) is of course much easier to understand and explain than the column-compressed one (CsparseMatrix) -- but the latter is the one that is efficiently used in the C-level libraries for matrix multiplication, Cholesky etc. > I sincerely hope my stumbling efforts have not caused any delays. Not at all, thank you David for all your helping on R-help !!! Martin > -- > David. [..................] > David Winsemius > Alameda, CA, USA > 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law ok.... given your other statement, it may be that Matrix *is* sufficiently adanced ;-) :-) ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.