On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen <kasperdanielhan...@gmail.com> wrote: > I don't fully understand "the use case for reducing by range is when the > entire dataset won't fit into memory". The basic assumption of these > functions (as far as I can see) is that the output data fits in memory. > What may not fit in memory is various earlier "iterations" of the data. For > example, in my use case, if I just read in all the data in all the ranges in > all the samples it is basically Rle's across 450MB times 38 files, which is > not small. What fits in memory is smaller chunks of this; that is true for > every application. >
I was unclear. I meant that, in approach1, you have an object, all.Rle, which contains Rles for every range over every file. Can you actually run this approach on the full dataset? > Reducing by range (or file) only makes sense when the final output includes > one entity for several ranges/files ... right? So I don't see how reduce > would help me. > Yes, I think we agree. This is not a good use case for reduce by range as now implemented. This is a use case which would benefit from the user-facing function internally calling pack()/unpack() to reduce the number of import() calls, and then in the end giving back the mean coverage over the input ranges. I want this too. https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456 (link to the old github repo, the new github repo is named GenomicFiles) > As I see the pack()/unpack() paradigm, it just re-orders the query ranges > (which is super nice and matters a lot for speed for some applications). As > I understand the code (and my understanding is developing) we need an extra > layer to support processing multiple ranges in one operation. > > I am happy to help apart from complaining. > > Best, > Kasper > > On Mon, Sep 29, 2014 at 8:55 AM, Michael Love <michaelisaiahl...@gmail.com> > wrote: >> >> Thanks for checking it out and benchmarking. We should be more clear >> in the docs that the use case for reducing by range is when the entire >> dataset won't fit into memory. Also, we had some discussion and >> Valerie had written up methods for packing up the ranges supplied by >> the user into a better form for querying files. In your case it would >> have packed many ranges together, to reduce the number of import() >> calls like your naive approach. See the pack/unpack functions, which >> are not in the vignette but are in the man pages. If I remember, >> writing code to unpack() the result was not so simple, and development >> of these functions was set aside for the moment. >> >> Mike >> >> On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen >> <kasperdanielhan...@gmail.com> wrote: >> > >> > I am testing GenomicFiles. >> > >> > My use case: I have 231k ranges of average width 1.9kb and total width >> > 442 >> > MB. I also have 38 BigWig files. I want to compute the average >> > coverage >> > of the 38 BigWig files inside each range. This is similar to wanting to >> > get coverage of - say - all promoters in the human genome. >> > >> > My data is residing on a file server which is connected to the compute >> > node >> > through ethernet, so basically I have very slow file access. Also. the >> > BigWig files are (perhaps unusually) big: ~2GB / piece. >> > >> > Below I have two approaches, one using basically straightforward code >> > and >> > the other using GenomicFiles (examples are 10 / 100 ranges on 5 files). >> > Basically GenomicFiles is 10-20x slower than the straightforward >> > approach. >> > This is most likely because reduceByRange/reduceByFile processes each >> > (range, file) separately. >> > >> > It seems naturally (to me) to allow some chunking of the mapping of the >> > ranges. My naive approach is fast (I assume) because I read multiple >> > ranges through one import() command. I know I have random access to the >> > BigWig files, but there must still be some overhead of seeking and >> > perhaps >> > more importantly instantiating/moving stuff back and forth to R. So >> > basically I would like to be able to write a MAP function which takes >> > ranges, file >> > instead of just >> > range, file >> > And then chunk over say 1,000s of ranges. I could then have an argument >> > to >> > reduceByXX called something like rangeSize, which is kind of yieldSize. >> > >> > Perhaps this is what is intended for the reduceByYield on BigWig files? >> > >> > In a way, this is what is done in the vignette with the >> > coverage(BAMFILE) >> > example where tileGenome is essentially constructed by the user to chunk >> > the coverage computation. >> > >> > I think the example of a set of regions I am querying on the files, will >> > be >> > an extremely common usecase going forward. The raw data for all the >> > regions >> > together is "too big" but I do a computation on each region to reduce >> > the >> > size. In this situation, all the focus is on the MAP step. I see the >> > need >> > for REDUCE in the case of the t-test example in the vignette, where the >> > return object is a single "thing" for each base. But in general, I >> > think >> > we will use these file structures a lot to construct things without >> > REDUCE >> > (across neither files nor ranges). >> > >> > Also, something completely different, it seems like it would be >> > convenient >> > for stuff like BigWigFileViews to not have to actually parse the file in >> > the MAP step. Somehow I would envision some kind of reading function, >> > stored inside the object, which just returns an Rle when I ask for a >> > (range, file). Perhaps this is better left for later. >> > >> > Best, >> > Kasper >> > >> > Examples >> > >> > approach1 <- function(ranges, files) { >> > ## By hand >> > all.Rle <- lapply(files, function(file) { >> > rle <- import(file, as = "Rle", which = ranges, format = >> > "bw")[ranges] >> > rle >> > }) >> > print(object.size(all.Rle), units = "auto") >> > mat <- do.call(cbind, lapply(all.Rle, function(xx) { >> > sapply(xx, mean) >> > })) >> > invisible(mat) >> > } >> > >> > system.time({ >> > mat1 <- approach1(all.grs[1:10], all.files[1:5]) >> > }) >> > >> > 160.9 Kb >> > user system elapsed >> > 1.109 0.001 1.113 >> > >> > system.time({ >> > mat1 <- approach1(all.grs[1:100], all.files[1:5]) >> > }) # less than 4x slower than previous call >> > >> > 3 Mb >> > user system elapsed >> > 4.101 0.019 4.291 >> > >> > >> > approach2 <- function(ranges, files) { >> > gf <- GenomicFiles(rowData = ranges, files = files) >> > MAPPER <- function(range, file, ....) { >> > library(rtracklayer) >> > rle <- import(file, which = range, as = "Rle", format = >> > "bw")[range] >> > mean(rle) >> > } >> > sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM = >> > SerialParam()) >> > sExp >> > } >> > >> > system.time({ >> > mat2 <- approach2(all.grs[1:10], all.files[1:5]) >> > }) # 8-9x slower than approach1 >> > >> > user system elapsed >> > 9.349 0.280 9.581 >> > >> > system.time({ >> > mat2 <- approach2(all.grs[1:100], all.files[1:5]) >> > }) # 9x slower than previous call, 20x slow than approach1 on same >> > input >> > >> > user system elapsed >> > 89.310 0.627 91.044 >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioc-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel