Re: [Bioc-devel] Reading and storing single cell ATAC data

Hervé Pagès Fri, 23 Sep 2016 17:08:47 -0700

Hi,

Yes, I would also encourage you to explore the SummarizedExperiment
route rather than ExpressionSet one. To be more precise,
RangedSummarizedExperiment is probably what you need: you can put
a GRanges object along the 1st dimension to describe your peaks and
a DataFrame object along the 2nd dimension to describe your samples.


For the matrix-like object that you use to store the assay data, it
can be anything (e.g. sparse matrix, on-disk matrix, etc...) as long
as it supports dim, dimnames, and 2D-style subsetting. That's the
bare minimum I think. It can support more (e.g. cbind, rbind), in
which case you'll be able to cbind and/or rbind 2 SummarizedExperiment
objects together. We've tried to keep the requirements for what can
be used to store the assay data as minimalist as possible.

One matrix-like container that you might want to consider is
DelayedArray defined in the HDF5Array package. Right now it uses
hdf5 for on-disk storage but other backends could be implemented
(something I'm planning to work on after the upcoming release).
You can stick a huge DelayedArray object that wouldn't fit in memory
in a SummarizedExperiment and manipulate it *almost* like you would
do with a regular SummarizedExperiment object. See ?HDF5Array for
an example of such an hdf5-backed SummarizedExperiment.

Hope this helps,
H.


On 09/23/2016 03:46 PM, Andrew McDavid wrote:

Hi Caleb,
Hopefully Herve will chime in regarding SummarizedExperiment, but yes, I
think you can and should inherit from that. The `assays` slot must be an
object of type `Assays`, but that does appear to include a sparse Matrix.
See the comments at the top of Assays-class.R in the tarball for
SummarizedExperiment.  For example:

library(SummarizedExperiment)
library(Matrix)
library(GenomicRanges)
Nrow=1e6
Ncol=1e4
assay=Matrix::Matrix(0, nrow=Nrow, ncol=Ncol, sparse=TRUE)
gr <- GRanges(Rle("chr2", Nrow),
              IRanges(seq_len(Nrow), width=10))
se <- SummarizedExperiment(assays=assay, rowRanges=gr)

As far as out-of-core storage of sparse matrices, I do not know of any good
(portable) solutions.   If it makes more sense to chunk the matrix along
some dimension, you could always pickle the chunked, (sparse) Matrix
objects. In my experience, the decision to adopt sparse vs out-of-core
dense arrays has often required empirical testing to determine what is
fastest/most scalable, since you lose caching benefits from sequential
memory access once you go sparse.  I know there has been talk of extending
SummarizedExperiment to easily permit the Assays to be hdf5-based.

Is disk space really going to be a limiting factor? If so, then you will
probably be IO-bound, so you will need to distribute the data across
computing nodes for your analysis to scale anyways, which suggests some
sort of map-reduce formalism.  Which to my knowledge no one has considered
yet in Bioconductor.  But unless you are generating > 1 TB of
semi-processed data, maybe you don't need to go there?

-Andrew

Message: 3
Date: Fri, 23 Sep 2016 17:37:04 -0400
From: Caleb Lareau <caleblar...@g.harvard.edu>
To: bioc-devel@r-project.org
Subject: [Bioc-devel] Reading and storing single cell ATAC data
Message-ID: <83b8a778-4daa-4a16-9da8-22def9ad1...@g.harvard.edu>
Content-Type: text/plain; charset="UTF-8"

Hi everyone?

I?m working with a team that?s generating single cell ATAC data in large
amounts and am designing the framework of an S4 object to facilitate
analyses in R. I have a couple of high-level questions that I wanted to
pose early to hopefully attain some community guidance in the
implementation of these data structures.

Question on S4 scATAC Structure--
It?s easy to imagine scATAC data as a matrix where the rows are particular
peaks and the columns are individual samples. We already have such an
impressive volume of data, such that if stored in an ordinary matrix, we
run into ~20 GB objects. As these data are very sparse, we store the peak
values in a sparse matrix (through the Matrix library). I wanted to collate
the peak information (probably in GRanges object) and sample information
(in a data frame) as well as some potential meta data in an S4 object.

Easy enough, sure, but after looking at the scRNA structure (e.g. scater <
https://bioconductor.org/packages/devel/bioc/vignettes/
scater/inst/doc/vignette.html>), I feel like I should be considering how
to inherit some of the nice properties from the canonical `ExpressionSet`
structure. However, since my constraints aren?t directly compatible (namely
the featureData slot really needs to be a GRanges and the exprs slot must
be an object from Matrix), it wasn?t clear to me how to maximize the
inheritance properties while adjusting to my unique constraints. Also, it
wasn?t clear to me whether or not I could inherent `SummarizedExperiment`
due to the different nature of the sparse matrix. Does anyone have any
advice on this structure?

Question on reading sparse matrices from disk--
I?m trying to work out the best to selectively read certain rows and
columns from a sparse matrix on disk into memory. I anticipate a time
fairly soon that loading our full scATAC data, even in sparse matrices, is
going to be untenable. Any matrix reading/slicing implementations that I?ve
seen don?t play friendly with sparse matrices. So, I hacked together two
solutions? 1) reads and subsets a gzipped matrix with 3 columns (row index;
column index; non-zero value) through a system call to awk. 2) converts
that same 3 column matrix into an SQLite object and send queries to read
values based on indices. The hiccups are that 1) doesn?t play friendly on
non-unix platforms and always scans the full file, and 2) is faster for
querying, but the binary object is ~7x larger than the gzipped object. I?ve
played around with hdf5 as well, but it didn?t seem to give me much back in
terms of speed or storage benefits comparatively. Has anyone found an
implementation that achieves a decent!
  lookup time and compression, or am I essentially needing to choose
between the two?

Thanks and have a great weekend!
-Caleb
        [[alternative HTML version deleted]]

------------------------------

Subject: Digest Footer

_______________________________________________
Bioc-devel mailing list
Bioc-devel@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-devel

------------------------------

End of Bioc-devel Digest, Vol 150, Issue 41
*******************************************


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Reading and storing single cell ATAC data

Reply via email to