On 09/10/2013 03:40 AM, Elena Grassi wrote:
You can use summarizeOverlaps with a 'BamFileList' created by something like

   myFiles = dir("/some/dir", pattern="bam$")
   bfl = BamFileList(myFiles, yieldSize=1000000)
   olaps = summarizeOverlaps(features, bfl)

see the example on the help page

   method?"summarizeOverlaps,GRanges,BamFileList"

I see, thanks. Right now I was looking for a solution "prior" to
summarizeOverlaps due to the structure that I've given to my package -
but I will re-think about it and
check if in this way it works (I only have 20 unittest or so to
rewrite, no worries for me :) ).

Right now I've found a way to define the single chromosome GRanges
that works with "which=..." and I would like to offer flexibility
for people with small RAM...therefore running the whole analysis on
one chr at a time seems a reasonable option all the same.

personally, I think it's better to iterate through the file using yieldSize rather than 'by chromosome' (it would handle very large files, for instance, where even a single chromosome has too many alignments to fit in memory).

The following function would be my second choice (not well-tested; hopefully a version will appear in the next GenomicRanges), which divides seqlengths into a GRangesList where each element contains ranges that add to approximately equal sized widths. So in a big bam file you could just make more tiles.

Maybe the next stumbling block is collating results across tiles? The basic strategy will be 'pre-allocate and fill'...

Martin

library(GenomicRanges)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
yy <- seqlengths(BamFile(fl))
yy <- yy[grep("_", names(yy), invert=TRUE)]
tile(yy, 40)


tile <-
    function(seqlengths, n)
{
    if (is.null(names(seqlengths)))
        stop("names(seqlengths) must not be NULL")

    clen <- cumsum(as.numeric(seqlengths))
    breaks0 <- ceiling(seq(1L, clen[length(clen)], length.out=n + 1L))
    breaks <- sort(unique(c(clen, breaks0)))[-1]
    grp <- cumsum(c(TRUE, breaks[-length(breaks)] %in% breaks0))

    idx <- cumsum(c(TRUE, (breaks %in% clen)[-length(breaks)]))
    splt <- split(breaks, names(seqlengths)[idx])
    ends <- Map(`-`, splt, c(0, clen[-length(clen)]))
    starts <- lapply(ends, function(x) c(0L, x[-length(x)]) + 1L)

    grg <- GRanges(rep(names(ends), sapply(ends, length)),
                   IRanges(unlist(starts, use.names=FALSE),
                           unlist(ends, use.names=FALSE)),
                   seqlengths=seqlengths)
    reduce(splitAsList(grg, grp))
}





http://comments.gmane.org/gmane.comp.lang.r.sequencing/755


one small thing that came out of that thread was that

   as(seqinfo(BamFile("/some/file")), "GRanges")

gives a GRanges of all the sequence lengths.

I saw that but I wanted to avoid having to do that for every bam and
merge the results afterwards - working at the "annotation" level
seemed more sensible to me.

I guess that I should have asked my boss to have 1 month to scavenge
around mans and vignettes before starting to write my first
bioconductor-R package :)
E.



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to