Re: [Bioc-devel] Help in designing class based on SummarizedExperiment

Tim Triche, Jr. Fri, 14 Feb 2014 16:49:56 -0800

The problem seems to boil down to what you've noticed about DataFrames of
Rles; they don't seem to be amenable to matrix operations at the moment.
 That may be a showstopper unless it can be addressed directly.  Hopefully
Martin or Kasper or Michael Lawrence will have a more encouraging answer :-/


Good luck,

--t



Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>


On Fri, Feb 14, 2014 at 4:25 PM, Peter Hickey <hic...@wehi.edu.au> wrote:

> Apologies for the bump, but is anyone able to help me on this? Or are
> these questions more appropriate for the general Bioconductor mailing list
> rather than Bioc-Devel?
>
> Many thanks,
> Pete
>
> ----- Original Message -----
> Date: Mon, 10 Feb 2014 13:20:47 +1100
> From: Peter Hickey <hic...@wehi.edu.au>
> To: bioc-devel@r-project.org
> Subject: [Bioc-devel] Help in designing class based on
>         SummarizedExperiment
> Message-ID: <e3127fd3-87e5-43da-b056-a633525ee...@wehi.edu.au>
> Content-Type: text/plain
>
> Hi all,
>
> Apologies up front for the rather long post.
>
> I'm designing a class to store what I call co-methylation m-tuples. These
> are based on a very simple tab-delimited file format.
> For example, here are 1-tuples (m = 1):
> chr             pos1    M       U
> chr1    57691   0       1
> chr1    59276   1       0
> chr1    60408   1       0
> chr1    63495   1       0
> chr1    63568   2       0
> chr1    63627   3       0
>
> 2-tuples (m = 2):
> chr             pos1    pos2    MM      MU      UM      UU
> chr1    567438  567570  0       0       0       2
> chr1    567501  567549  0       0       0       35
> chr1    567549  567558  0       1       0       139
>
> 3-tuples (m = 3):
> chr             pos1    pos2    pos3    MMM     MMU     MUM     MUU
> UMM     UMU     UUM     UUU
> chr1    13644   13823   13828   1               0               0
>       0               0               0               0               0
> chr1    14741   14747   14773   1               0               0
>       0               0               0               0               0
>
> etc.
>
> 1-tuples are basically the standard input to an analysis of BS-seq data.
>
> I think of these files as being comprised of 3 parts: the 'chr' column
> (chr), the 'pos' matrix (pos1, pos2, pos3) and the 'counts' matrix (MMM,
> MMU, MUM, MUU, UMM, UMU, UUM, UUU), when m = 3. For a given value of 'm'
> there is one 'chr' column, m 'pos' columns and 2^m 'counts' columns.
>
> I want to implement a class for these objects as I'm writing a package for
> the analysis of this type of data. I'd like a GRanges-type object storing
> the genomic information and a matrix-like object storing the counts. After
> tinkering around for a while, and doing some reading of the code in
> packages such as GenomicRanges and bsseq, I decided to extend the
> SummarizedExperiment class.  I now have a prototype but I have some
> questions and would appreciate feedback on some of my design choices before
> I translate my existing functions to work with this class of object.
>
> Here is the code for the prototype:
> #####################################################################
> library(GenomicRanges)
>
> setClass("CoMeth", contains = "SummarizedExperiment")
>
> CoMeth <- function(seqnames, pos, counts, m, methylation_type,
> sample_name, strand = "*", seqlengths = NULL, seqinfo = NULL){
>
>   # Argument checks, etc. go here #
>
>   gr <- GRanges(seqnames = seqnames, ranges = IRanges(start = pos[[1]],
> end = pos[[length(pos)]]), strand = strand, seqlengths = seqlengths,
> seqinfo = seqinfo) # The width of each element is defined by the first and
> last 'pos', e.g. for 3-tuples it is defined by pos1 and pos3.
>   # Need to store the "extra" positions if m > 2. Each additional position
> is stored as a separate assay
>   if (m > 2){
>     extra_pos <- lapply(seq(2, m - 1, 1), function(i, pos){
>       pos[[i]]
>     }, pos = pos)
>     names(extra_pos) <- names(pos)[2:(m-1)]
>   } else {
>     extra_pos <- NULL
>   }
>   assays <- SimpleList(c(counts, extra_pos))
>   colData <- DataFrame(sample_name = sample_name, m = m, methylation_type
> = paste0(sort(methylation_type), collapse = '/'))
>   cometh <- SummarizedExperiment(assays = assays, rowData = gr, colData =
> colData)
>   cometh <- as(cometh, "CoMeth")
>
>   return(cometh)
> }
>
> And here's some example data:
> # A function that roughly imitates the output of a call to scan() to read
> in BS-seq m-tuple data
> # m is the size of the m-tuples
> # n is the number of m-tuples
> # z is the proportion of each column of 'counts' that is zero
> make_test_data <- function(m, n, z){
>   seqnames <- list(seqnames = rep('chr1', n))
>   pos <- lapply(1:m, function(x, n){matrix(seq(from = 1 + x - 1, to = n +
> x - 1, by = 1), ncol = 1)}, n = n) # Need these to be matrices rather than
> vectors
>   names(pos) <- paste0('pos', 1:m)
>   # A rough hack to simulate counts where a proportion (z) are 0 and the
> rest are sampled from Poisson(lambda). Small values of lambda will inflate
> the zero-count.
>   counts <- mapply(FUN = function(i, z, n, lambda){
>     nz <- floor(n * (1 - z))
>     matrix(sample(c(rpois(nz, lambda), rep(0, n - nz))), ncol = 1)
>     }, i = 1:(2 ^ m), z = z, n = n, lambda = 4, SIMPLIFY = FALSE) # Need
> these to be matrices rather than vectors
>   names(counts) <- sort(do.call(paste0, expand.grid(lapply(seq_len(m),
> function(x){c('M', 'U')}))))
>   return(c(seqnames, pos, counts))
> }
>
> m <- 3 # An example using 3-tuples
> n <- 1000 # A typical value for 3-tuples from a methylC-seq experiment is
> n = 17,000,000
> z <- c(0.2, 0.6, 0.6, 0.7, 0.6, 0.8, 0.8, 0.7) # Typical proportions of
> each column of 'counts' that are zero when using 3-tuples for a methylC-seq
> experiment
> test_data <- make_test_data(n = n, m = m, z = z)
> cometh <- CoMeth(seqnames = test_data[['seqnames']], pos =
> test_data[grepl('pos', names(test_data))], counts = test_data[grepl('[MU]',
> names(test_data))], m = m, methylation_type = 'CG', sample_name =
> 'test_data')
>
> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] GenomicRanges_1.14.4 XVector_0.2.0        IRanges_1.20.6
> BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] stats4_3.0.2 tools_3.0.2
> #####################################################################
>
> Questions
>         1. How can I move the 'extra_pos' columns from the assay slot but
> keep the copy-on-change behaviour? From a design perspective, I think it
> would make more sense for the 'extra_pos' columns, i.e. ('pos2') for
> 3-tuples and ('pos2', 'pos3') for 4-tuples etc., to be in their own slot
> rather than in the assays slot, after all, they aren't assays but rather
> are additional genomic co-ordinates. The 'extra_pos' fields are fixed (at
> least until I start subsetting or combining multiple CoMeth objects). My
> understanding of the the SummarizedExperiment class is that the assays slot
> is a reference class to avoid excessive copying when changing other slots
> of a SummarizedExperiment object. So if the 'extra_pos' columns were stored
> outside of the assays slot then these would have to be copied when any
> changes are made to the other slots of a CoMeth object, correct? Is there a
> way to avoid this, i.e. so that these 'extra_pos' columns are stored
> separately from the assays slot but with the !
>  copy-on-change behaviour of the assays slot?
>         2. Is the correct to compute something based on the 'counts' data
> via the assay() accessor? For example, I might want a helper function
> getCounts(cometh) that does the equivalent of sapply(X = 1:(2^m),
> function(i, cometh){assay(cometh, i)}, cometh = cometh). Similarly, I might
> want to compute the coverage of an m-tuple, which would be the equivalent
> of rowSums(getCounts(cometh)). Is this the correct way to do this sort of
> thing?
>         3. How do I measure the size of a SummarizedExperiment/CoMeth
> object? For example, with the test data, print(object.size(cometh), units =
> "auto") < print(object.size(assays(cometh)), units = "auto"), so it seems
> that the size of the assays slot isn't counted by object.size().
>         4. Is it possible to store an Rle-type object in the assays slot
> of a SummarizedExperiment? 20-80% of the entries in each column of 'counts'
> are zero and there are often runs of zeros. So I thought that perhaps an
> Rle representation (column-wise) might be more (memory) efficient. But I
> can't seem to get an Rle object in the assays slot (I tried via DataFrame);
> is it even possible?
>         5. Are there matrix-like objects with Rle columns? I found this
> thread started by Kasper Hansen (
> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046473.html)
> discussing the idea of matrix-like object where the columns are Rle's; I
> could imagine using such an object for a CoMeth object containing multiple
> samples, i.e. MMM is a matrix-like object with ncol = # of samples, MMU is
> matrix-like object with ncol = # of samples, etc. Was anything like this
> ever implemented? My reading of the previous thread was to use a DataFrame
> but the "matrix API", e.g. rowSums, doesn't work with DataFrames (and see
> (4) as to whether it's even possible to store such objects in the assays
> slot).
>
> Many thanks for your help in answering these questions. Any other
> suggestions on the design of the CoMeth class are appreciated.
>
> Thanks,
> Pete
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hic...@wehi.edu.au
> http://www.wehi.edu.au
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:10}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Help in designing class based on SummarizedExperiment

Reply via email to