On 09/10/2013 03:40 AM, Elena Grassi wrote:
You can use summarizeOverlaps with a 'BamFileList' created by something like
myFiles = dir("/some/dir", pattern="bam$")
bfl = BamFileList(myFiles, yieldSize=1000000)
olaps = summarizeOverlaps(features, bfl)
see the example on the help page
method?"summarizeOverlaps,GRanges,BamFileList"
I see, thanks. Right now I was looking for a solution "prior" to
summarizeOverlaps due to the structure that I've given to my package -
but I will re-think about it and
check if in this way it works (I only have 20 unittest or so to
rewrite, no worries for me :) ).
Right now I've found a way to define the single chromosome GRanges
that works with "which=..." and I would like to offer flexibility
for people with small RAM...therefore running the whole analysis on
one chr at a time seems a reasonable option all the same.
personally, I think it's better to iterate through the file using yieldSize
rather than 'by chromosome' (it would handle very large files, for instance,
where even a single chromosome has too many alignments to fit in memory).
The following function would be my second choice (not well-tested; hopefully a
version will appear in the next GenomicRanges), which divides seqlengths into a
GRangesList where each element contains ranges that add to approximately equal
sized widths. So in a big bam file you could just make more tiles.
Maybe the next stumbling block is collating results across tiles? The basic
strategy will be 'pre-allocate and fill'...
Martin
library(GenomicRanges)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
yy <- seqlengths(BamFile(fl))
yy <- yy[grep("_", names(yy), invert=TRUE)]
tile(yy, 40)
tile <-
function(seqlengths, n)
{
if (is.null(names(seqlengths)))
stop("names(seqlengths) must not be NULL")
clen <- cumsum(as.numeric(seqlengths))
breaks0 <- ceiling(seq(1L, clen[length(clen)], length.out=n + 1L))
breaks <- sort(unique(c(clen, breaks0)))[-1]
grp <- cumsum(c(TRUE, breaks[-length(breaks)] %in% breaks0))
idx <- cumsum(c(TRUE, (breaks %in% clen)[-length(breaks)]))
splt <- split(breaks, names(seqlengths)[idx])
ends <- Map(`-`, splt, c(0, clen[-length(clen)]))
starts <- lapply(ends, function(x) c(0L, x[-length(x)]) + 1L)
grg <- GRanges(rep(names(ends), sapply(ends, length)),
IRanges(unlist(starts, use.names=FALSE),
unlist(ends, use.names=FALSE)),
seqlengths=seqlengths)
reduce(splitAsList(grg, grp))
}
http://comments.gmane.org/gmane.comp.lang.r.sequencing/755
one small thing that came out of that thread was that
as(seqinfo(BamFile("/some/file")), "GRanges")
gives a GRanges of all the sequence lengths.
I saw that but I wanted to avoid having to do that for every bam and
merge the results afterwards - working at the "annotation" level
seemed more sensible to me.
I guess that I should have asked my boss to have 1 month to scavenge
around mans and vignettes before starting to write my first
bioconductor-R package :)
E.
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel