The problem seems to boil down to what you've noticed about DataFrames of Rles; they don't seem to be amenable to matrix operations at the moment. That may be a showstopper unless it can be addressed directly. Hopefully Martin or Kasper or Michael Lawrence will have a more encouraging answer :-/
Good luck, --t Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Fri, Feb 14, 2014 at 4:25 PM, Peter Hickey <hic...@wehi.edu.au> wrote: > Apologies for the bump, but is anyone able to help me on this? Or are > these questions more appropriate for the general Bioconductor mailing list > rather than Bioc-Devel? > > Many thanks, > Pete > > ----- Original Message ----- > Date: Mon, 10 Feb 2014 13:20:47 +1100 > From: Peter Hickey <hic...@wehi.edu.au> > To: bioc-devel@r-project.org > Subject: [Bioc-devel] Help in designing class based on > SummarizedExperiment > Message-ID: <e3127fd3-87e5-43da-b056-a633525ee...@wehi.edu.au> > Content-Type: text/plain > > Hi all, > > Apologies up front for the rather long post. > > I'm designing a class to store what I call co-methylation m-tuples. These > are based on a very simple tab-delimited file format. > For example, here are 1-tuples (m = 1): > chr pos1 M U > chr1 57691 0 1 > chr1 59276 1 0 > chr1 60408 1 0 > chr1 63495 1 0 > chr1 63568 2 0 > chr1 63627 3 0 > > 2-tuples (m = 2): > chr pos1 pos2 MM MU UM UU > chr1 567438 567570 0 0 0 2 > chr1 567501 567549 0 0 0 35 > chr1 567549 567558 0 1 0 139 > > 3-tuples (m = 3): > chr pos1 pos2 pos3 MMM MMU MUM MUU > UMM UMU UUM UUU > chr1 13644 13823 13828 1 0 0 > 0 0 0 0 0 > chr1 14741 14747 14773 1 0 0 > 0 0 0 0 0 > > etc. > > 1-tuples are basically the standard input to an analysis of BS-seq data. > > I think of these files as being comprised of 3 parts: the 'chr' column > (chr), the 'pos' matrix (pos1, pos2, pos3) and the 'counts' matrix (MMM, > MMU, MUM, MUU, UMM, UMU, UUM, UUU), when m = 3. For a given value of 'm' > there is one 'chr' column, m 'pos' columns and 2^m 'counts' columns. > > I want to implement a class for these objects as I'm writing a package for > the analysis of this type of data. I'd like a GRanges-type object storing > the genomic information and a matrix-like object storing the counts. After > tinkering around for a while, and doing some reading of the code in > packages such as GenomicRanges and bsseq, I decided to extend the > SummarizedExperiment class. I now have a prototype but I have some > questions and would appreciate feedback on some of my design choices before > I translate my existing functions to work with this class of object. > > Here is the code for the prototype: > ##################################################################### > library(GenomicRanges) > > setClass("CoMeth", contains = "SummarizedExperiment") > > CoMeth <- function(seqnames, pos, counts, m, methylation_type, > sample_name, strand = "*", seqlengths = NULL, seqinfo = NULL){ > > # Argument checks, etc. go here # > > gr <- GRanges(seqnames = seqnames, ranges = IRanges(start = pos[[1]], > end = pos[[length(pos)]]), strand = strand, seqlengths = seqlengths, > seqinfo = seqinfo) # The width of each element is defined by the first and > last 'pos', e.g. for 3-tuples it is defined by pos1 and pos3. > # Need to store the "extra" positions if m > 2. Each additional position > is stored as a separate assay > if (m > 2){ > extra_pos <- lapply(seq(2, m - 1, 1), function(i, pos){ > pos[[i]] > }, pos = pos) > names(extra_pos) <- names(pos)[2:(m-1)] > } else { > extra_pos <- NULL > } > assays <- SimpleList(c(counts, extra_pos)) > colData <- DataFrame(sample_name = sample_name, m = m, methylation_type > = paste0(sort(methylation_type), collapse = '/')) > cometh <- SummarizedExperiment(assays = assays, rowData = gr, colData = > colData) > cometh <- as(cometh, "CoMeth") > > return(cometh) > } > > And here's some example data: > # A function that roughly imitates the output of a call to scan() to read > in BS-seq m-tuple data > # m is the size of the m-tuples > # n is the number of m-tuples > # z is the proportion of each column of 'counts' that is zero > make_test_data <- function(m, n, z){ > seqnames <- list(seqnames = rep('chr1', n)) > pos <- lapply(1:m, function(x, n){matrix(seq(from = 1 + x - 1, to = n + > x - 1, by = 1), ncol = 1)}, n = n) # Need these to be matrices rather than > vectors > names(pos) <- paste0('pos', 1:m) > # A rough hack to simulate counts where a proportion (z) are 0 and the > rest are sampled from Poisson(lambda). Small values of lambda will inflate > the zero-count. > counts <- mapply(FUN = function(i, z, n, lambda){ > nz <- floor(n * (1 - z)) > matrix(sample(c(rpois(nz, lambda), rep(0, n - nz))), ncol = 1) > }, i = 1:(2 ^ m), z = z, n = n, lambda = 4, SIMPLIFY = FALSE) # Need > these to be matrices rather than vectors > names(counts) <- sort(do.call(paste0, expand.grid(lapply(seq_len(m), > function(x){c('M', 'U')})))) > return(c(seqnames, pos, counts)) > } > > m <- 3 # An example using 3-tuples > n <- 1000 # A typical value for 3-tuples from a methylC-seq experiment is > n = 17,000,000 > z <- c(0.2, 0.6, 0.6, 0.7, 0.6, 0.8, 0.8, 0.7) # Typical proportions of > each column of 'counts' that are zero when using 3-tuples for a methylC-seq > experiment > test_data <- make_test_data(n = n, m = m, z = z) > cometh <- CoMeth(seqnames = test_data[['seqnames']], pos = > test_data[grepl('pos', names(test_data))], counts = test_data[grepl('[MU]', > names(test_data))], m = m, methylation_type = 'CG', sample_name = > 'test_data') > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 > BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] stats4_3.0.2 tools_3.0.2 > ##################################################################### > > Questions > 1. How can I move the 'extra_pos' columns from the assay slot but > keep the copy-on-change behaviour? From a design perspective, I think it > would make more sense for the 'extra_pos' columns, i.e. ('pos2') for > 3-tuples and ('pos2', 'pos3') for 4-tuples etc., to be in their own slot > rather than in the assays slot, after all, they aren't assays but rather > are additional genomic co-ordinates. The 'extra_pos' fields are fixed (at > least until I start subsetting or combining multiple CoMeth objects). My > understanding of the the SummarizedExperiment class is that the assays slot > is a reference class to avoid excessive copying when changing other slots > of a SummarizedExperiment object. So if the 'extra_pos' columns were stored > outside of the assays slot then these would have to be copied when any > changes are made to the other slots of a CoMeth object, correct? Is there a > way to avoid this, i.e. so that these 'extra_pos' columns are stored > separately from the assays slot but with the ! > copy-on-change behaviour of the assays slot? > 2. Is the correct to compute something based on the 'counts' data > via the assay() accessor? For example, I might want a helper function > getCounts(cometh) that does the equivalent of sapply(X = 1:(2^m), > function(i, cometh){assay(cometh, i)}, cometh = cometh). Similarly, I might > want to compute the coverage of an m-tuple, which would be the equivalent > of rowSums(getCounts(cometh)). Is this the correct way to do this sort of > thing? > 3. How do I measure the size of a SummarizedExperiment/CoMeth > object? For example, with the test data, print(object.size(cometh), units = > "auto") < print(object.size(assays(cometh)), units = "auto"), so it seems > that the size of the assays slot isn't counted by object.size(). > 4. Is it possible to store an Rle-type object in the assays slot > of a SummarizedExperiment? 20-80% of the entries in each column of 'counts' > are zero and there are often runs of zeros. So I thought that perhaps an > Rle representation (column-wise) might be more (memory) efficient. But I > can't seem to get an Rle object in the assays slot (I tried via DataFrame); > is it even possible? > 5. Are there matrix-like objects with Rle columns? I found this > thread started by Kasper Hansen ( > https://stat.ethz.ch/pipermail/bioconductor/2012-June/046473.html) > discussing the idea of matrix-like object where the columns are Rle's; I > could imagine using such an object for a CoMeth object containing multiple > samples, i.e. MMM is a matrix-like object with ncol = # of samples, MMU is > matrix-like object with ncol = # of samples, etc. Was anything like this > ever implemented? My reading of the previous thread was to use a DataFrame > but the "matrix API", e.g. rowSums, doesn't work with DataFrames (and see > (4) as to whether it's even possible to store such objects in the assays > slot). > > Many thanks for your help in answering these questions. Any other > suggestions on the design of the CoMeth class are appreciated. > > Thanks, > Pete > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hic...@wehi.edu.au > http://www.wehi.edu.au > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}} _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel