On 12/21/2017 06:22 AM, Francesco Napolitano wrote:
Hi,

I need to deal with very large matrices and I was thinking of using
HDF5-based data models. However, from the documentation and examples
that I have been looking at, I'm not quite sure how to do this.

My use case is as follows.
I want to build a very large matrix one column at a time, and I need
to write columns directly to disk since I would otherwise run out of
memory. I need a format that, afterwards, will allow me to extract
subsets of rows or columns and rank them. The subsets will be small
enough to be loaded in memory. Can I achieve this with current HDF5
support in R?

this is basically straight-forward in rhdf5. The idea is to create a dataset of the size to contain your total data

  library(rhdf5)
  fl <- tempfile()
  h5createFile(fl)

  nrow <- 10000
  ncol <- 100
  h5createDataset(fl, "big", c(nrow, ncol), showWarnings = FALSE)

then to fill it in chunks by specifying which start row / column you'd like to write to and the 'count' of the number data points in each direction you'd like to write to

  chunk_ncol <- ncol / 10
  j <- 1                           # which column to start writing?

  while (j < ncol) {
    m <- matrix(seq(1, length.out = nrow * chunk_ncol), nrow)
    h5write(m, fl, "big", start = c(1, j), count = c(nrow, chunk_ncol))
    j <- j + chunk_ncol
  }

You can read arbitrary  'slabs'

  h5read(fl, "big", start = c(1, 1), count = c(5, 5))
  h5read(fl, "big", start = c(1, 9), count = c(5, 2))

Probably you don't want to write 1 column at a time, but as many columns as comfortably fit into memory. This minimizes the number of R function calls needed to write / read the data.

The HDF5Array package provides an easy abstraction for reading (probably writing is possible too, but it might be easier to understand the building blocks first).

> library(HDF5Array)
> hdf <- HDF5Array(fl, "big")
> hdf
HDF5Matrix object of 10000 x 100 doubles:
           [,1]   [,2]   [,3] ...  [,99] [,100]
    [1,]      1  10001  20001   .  80001  90001
    [2,]      2  10002  20002   .  80002  90002
    [3,]      3  10003  20003   .  80003  90003
    [4,]      4  10004  20004   .  80004  90004
    [5,]      5  10005  20005   .  80005  90005
     ...      .      .      .   .      .      .
 [9996,]   9996  19996  29996   .  89996  99996
 [9997,]   9997  19997  29997   .  89997  99997
 [9998,]   9998  19998  29998   .  89998  99998
 [9999,]   9999  19999  29999   .  89999  99999
[10000,]  10000  20000  30000   .  90000 100000
> hdf[1:5, 1:5]
DelayedMatrix object of 5 x 5 doubles:
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,]     1 10001 20001 30001 40001
[2,]     2 10002 20002 30002 40002
[3,]     3 10003 20003 30003 40003
[4,]     4 10004 20004 30004 40004
[5,]     5 10005 20005 30005 40005
> as.matrix(hdf[1:5, 1:5])
     [,1]  [,2]  [,3]  [,4]  [,5]
[1,]    1 10001 20001 30001 40001
[2,]    2 10002 20002 30002 40002
[3,]    3 10003 20003 30003 40003
[4,]    4 10004 20004 30004 40004
[5,]    5 10005 20005 30005 40005
> rowSums(hdf[1:5, 1:5])
[1] 100005 100010 100015 100020 100025

Martin


Any help greatly appreciated.

than you,
Francesco

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



This email message may contain legally privileged and/or...{{dropped:2}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to