Note that there is already an import,TabixFile that handles indexed restriction via the "which" arg. rtracklayer autodetects tabix files, so the user just calls import("my.bed.bgz", which=gr).
Michael On Thu, Jul 21, 2016 at 1:17 PM, Martin Morgan <martin.mor...@roswellpark.org> wrote: > On 07/21/2016 02:12 PM, Simon Anders <simon.and...@fimm.fi> (by way of Simon > Anders wrote: >> >> Hi Hervé, Martin, Wolfgang, and anybody else who might be interested >> >> this post is stimulated by a discussion Martin Morgan and I had last >> week at the CSAMA course. It is on how to improve in Bioconductor the >> handling of large genomics data files like GFF or BED files with many >> millions of lines. >> >> The 'chipseq' lab of the course was the starting point. (See last >> years's version at [1]). There, a BED file with filtered ChIP-Seq reads >> was read in and then reads were counted in windows tiling the mouse >> genome. In the lab, we only used data from chromosome 6 to keep the >> data small. This, of course prompted the question whether the presented >> approach would work with the full data, too. >> >> I don't think it would, and I am not sure what the best practice would >> be. >> >> We used this work flow: >> - Read in a BED file with 600k lines, using 'import.bed', which >> returns a GRanges object. >> - Extend the ranges to average fragment length using the 'resize' >> method on the GRanges object >> - Use 'tileGenomes' to cut chromosome 6 into bins of width 100 bp, >> which returns another GRanges object. >> - Use 'countOverlaps' to count how many reads from the first GRanges >> object fall into each of the bins in the second GRanges object. >> >> Everything here happens in memory, but a BED file with reads from >> all chromosomes would not fit in memory. There is no facility to only >> read parts of a BED file. >> >> For SAM files, we have at least the 'yieldSize' argument to 'BamFile' >> but 'import.bed' does not offer anything comparable. Same holds for the >> import functions for GFF and the like. >> >> First: Is there anything already in place that I am unaware of? Besides >> using 'scan' and coding everything from scratch, of course. > > > There is > > Rsamtools::TabixFile() with yieldSize for iteration > > Rsamtools::scanTabix() both for iteration and 'restriction' (IRanges / > GRanges() queries into sorted, indexed tabix) > > Rsamtools::bgzip() and indexTabix() for indexing; there seems not to be a > sortTabix... > > and also > > GenomicFiles::reduceByYield() as a framework for iteration > > rtracklayer::import() has a `text=` argument. > > Together, these lead to the following illustration > > suppressPackageStartupMessages({ > library(rtracklayer) > library(Rsamtools) > library(GenomicFiles) > }) > > ## preparation > > ## cat ES_input_filtered_ucsc_chr6.bed | sort -k1,1V -k2,2n > \ > ## ES_input_filtered_ucsc_chr6.sorted.bed > > fl <- "ES_input_filtered_ucsc_chr6.sorted.bed" > bgz <- bgzip(fl) > indexTabix(bgz, format="bed") > > ## restriction > > tbx0 <- TabixFile(bgz) > param <-GRanges("chr6", IRanges(c(3324254, 4324254), width=1000)) > records <- scanTabix(tbx0, param=param) > gr <- import(format="bed", text =unlist(records, use.names=FALSE)) > grl <- split(gr, rep(names(records), lengths(records))) > > ## iteration -- GenomicFiles vignette section 6.2 > > YIELD <- function(X) unlist(scanTabix(X, format="bed")) > MAP <- function(VALUE, ...) import(format="bed", text=VALUE) > REDUCE <- append > tbx1 <- open(TabixFile(bgz, yieldSize=100000)) # about 466k records > xx <- reduceByYield(tbx1, YIELD, MAP, REDUCE) > close(tbx1) > > Martin > > >> >> If not, two proposals how we could improve Bioconductor here. >> >> The small solution would be to add an option like 'yield.size' to >> 'import.bed' and related function for other file types. However, even >> this would require to change the use interface, because we now need >> two functions for each file type, one to open the file and set the block >> size (e.g., 'BamFile'), another to read a block ('scanBam' or >> 'readGAlignments'). >> >> Also, one still needs to write a loop to iterate through the >> yielded blocks. The body of this loop would have to be vectorized to >> deal with a whole block of records, but still, the vectorization does >> not remove the need for an outer loop. Such incomplete >> semi-vectorization is hard to read and not quite in line with the >> spirit R's column-oriented style of programming. >> >> A better way might hence be to have a new class of GRanges-like objects >> that have the same interface as GRanges but have their data on disk. The >> Tabix method [2], for example, offers indexed random access to any >> sorted tab-delimited genomic data table, such as a SAM, GFF or BED >> file. For each file type, we would have a function that takes a block of >> a few thousand rows of the TAB-delimited text file and converts it into >> a GRanges object with extra columns filled with the file format's >> information (or a DataFrame with a GRanges column and further columns). >> This new file-linked GRanges-type objects would be constructed with a >> link to a Tabix-indexed file and such a function, and all methods for >> GRanges are overloaded to use this methods and the Tabix functionality >> to pull data from disk as needed. >> >> I suggest Tabix simply because it is popular, and we already have >> functionality for it in the ShortRead package. Many other indexing >> solutions are available, too. >> >> For the use case described above, the BED file does not even need to be >> sorted. The 'countOverlaps' functionality only needs one GRanges (here: >> the tiles, which reside in memory) to be sorted, and could stream >> through the other. Hence, another version of a file-linked GRanges, >> linked to an unsorted file without index, would be useful, too. It >> would throw an error if used in a context where random access is >> requried. >> >> I think that such blockwise access to file-linked data is quite crucial >> and the lack of it is a reason why Bioconductor has fallen behind in >> certain application areas, such as, e.g., ChIP-Seq analysis. >> >> Any comments or alternative suggestions? Maybe even anybody who feel >> inspired to give it a try to implement something roughly along such >> lines? >> >> Simon >> >> >> >> >> [1] >> >> https://github.com/Bioconductor/CSAMA2015/tree/master/materials/labs/4_Thursday/Epigenetics_and_Chip_seq >> >> [2] H. Li, Bioinformatics (2011) 27:718. >> doi:10.1093/bioinformatics/btq671 >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > > This email message may contain legally privileged and/or...{{dropped:2}} > > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel