Thanks for the really valuable inputs, developing the package and
updating it regularly. I will be glad if I can contribute in any way.
In problem three, however, I am interested in knowing a generic way to
apply any function on columns of a big.matrix object (obviously without
loading the data into R). May be the source code of the function
"colmean" can help, if that is not too much to ask for. Or if we can
develop a function similar to "apply" of the base R.
Regards
Utkarsh
Jay Emerson wrote:
We also have ColCountNA(), which is not currently exposed to the user
but will be in the next version.
Jay
On Tue, Jun 2, 2009 at 2:08 PM, Jay Emerson <jayemer...@gmail.com> wrote:
Thanks for trying this out.
Problem 1. We'll check this. Options should certainly be available. Thanks!
Problem 2. Fascinating. We just (yesterday) implemented a
sub.big.matrix() function doing exactly
this, creating something that is a big matrix but which just
references a contiguous subset of the
original matrix. This will be available in an upcoming version
(hopefully in the next week). A more
specialized function would create an entirely new big.matrix from a
subset of a first big.matrix,
making an actual copy, but this is something else altogether. You
could do this entirely within R
without much work, by the way, and only 2* memory overhead.
Problem 3. You can count missing values using mwhich(). For other
exploration (e.g. skewness)
at the moment you should just extract a single column (variable) at a
time into R, study it, then get the
next column, etc... . We will not be implementing all of R's
functions directly with big.matrix objects.
We will be creating a new package "bigmemoryAnalytics" and would
welcome contributions to the
package.
Feel free to email us directly with bugs, questions, etc...
Cheers,
Jay
----------------------------------------------------------
From: utkarshsinghal <utkarsh.sing...@global-analytics.com>
Date: Tue, Jun 2, 2009 at 8:25 AM
Subject: [R] bigmemory - extracting submatrix from big.matrix object
To: r help <r-help@r-project.org>
I am using the library(bigmemory) to handle large datasets, say 1 GB,
and facing following problems. Any hints from anybody can be helpful.
_Problem-1:
_
I am using "read.big.matrix" function to create a filebacked big
matrix of my data and get the following warning:
x = read.big.matrix("/home/utkarsh.s/data.csv",header=T,type="double",shared=T,backingfile =
"backup", backingpath = "/home/utkarsh.s")
Warning message:
In filebacked.big.matrix(nrow = numRows, ncol = numCols, type = type, :
A descriptor file has not been specified. A descriptor named
backup.desc will be created.
However there is no such argument in "read.big.matrix". Although there
is an argument "descriptorfile" in the function "as.big.matrix" but if
I try to use it in "read.big.matrix", I get an error showing it as
unused argument (as expected).
_Problem-2:_
I want to get a filebacked *sub*matrix of "x", say only selected
columns: x[, 1:100]. Is there any way of doing that without actually
loading the data into R memory.
_
Problem-3
_There are functions available like: summary, colmean, colsd, ... for
standard summary statistics. But is there any way to calculate other
summaries say number of missing values or skewness of each variable,
without loading the whole data into R memory.
Regards
Utkarsh
--
John W. Emerson (Jay)
Assistant Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.