Re: [Bioc-devel] GenomicFiles: chunking

Valerie Obenchain Mon, 27 Oct 2014 11:53:57 -0700

Looks like the test code didn't make it through. Attaching again ...



On 10/27/2014 11:35 AM, Valerie Obenchain wrote:

Hi Kasper and Mike,

I've added 2 new functions to GenomicFiles and deprecated the old
classes. The vignette now has a graphic which (hopefully) clarifies the
MAP / REDUCE mechanics of the different functions.

Below is some performance testing for the new functions and answers to
leftover questions from previous emails.

Major changes to GenomicFiles (in devel):

- *FileViews classes have been deprecated:

The idea is to use the GenomicFiles class to hold any type of file be it
BAM, BigWig, character vector etc. instead of having name-specific
classes like BigWigFileViews. Currently GenomicFiles does not inherit
from SummarizedExperiment but it may in the future.

- Add reduceFiles() and reduceRanges():

These functions pass all ranges or files to MAP vs the lapply approach
taken in reduceByFiles() and reduceByRanges().

(1) Performance:

When testing with reduceByFile() you noted "GenomicFiles" is 10-20x
slower than the straightforward approach". You also noted this was
probably because of the lapply over all ranges - true. (Most likely
there was overhead in creating the SE object as well.) With the new
reduceFiles(), passing all ranges at once, we see performance very
similar to that of the 'by hand' approach.

In the test code I've used Bam instead of BigWig. Both test functions
output lists, have comparable MAP and REDUCE steps etc.

I used 5 files ('fls') and a granges ('grs') of length 100.
 > length(grs)
[1] 100

 > sum(width(grs))
[1] 1000000

FUN1 is the 'by hand' version. These results are similar to what you
saw, not quite a 4x difference between 10 and 100 ranges.

 >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10)
 > Unit: seconds
 >                   expr      min       lq     mean   median       uq
     max
 >   FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331 1.222298
1.256741
 >  FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486 4.278463
4.533846
 >  neval
 >     10
 >     10

FUN2 is the reduceFiles() approach and the results are very similar to
FUN1.

 >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10)
 > Unit: seconds
 >                   expr      min       lq     mean   median       uq
     max
 >   FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154 1.267655
1.275698
 >  FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007 4.384064
4.676068
 >  neval
 >     10
 >     10

(2) Request for "chunking of the mapping of ranges":

For now we decided not to add a 'yieldSize' argument for chunking. There
are 2 approaches to chunking through ranges *within* the same file. In
both cases the user spits the ranges, either before calling the function
or in the MAP step.

i) reduceByFile() with a GRangesList:

The user provides a GRangesList as the 'ranges' arg. On each worker,
lapply applies MAP to the one files and all elements of the GRangesList.

ii) reduceFiles() with a MAP that handles chunking:

The user split ranges in MAP and uses lapply or another loop to iterate.
For example,

MAP <- function(range, file, ...) {
     lst = split(range, someFactor)
     someFUN = function(rngs, file, ...) do something
     lapply(lst, FUN=someFun, ...)
}

The same ideas apply for chunking though ranges *across* files with
reduceByRange() and reduceRanges().

iii) reduceByRange() with a GRangesList:

Mike has a good example here:
https://gist.github.com/mikelove/deaff999984dc75f125d

iv) reduceRanges():

'ranges' should be a GRangesList. The MAP step will operate on an
element of the GRangesList and all files. Unless you want to operate on
all files at once I'd use reduceByRange() instead.

(3) Return objects have different shape:

Previous question:

"...
Why does the return object of reduceByFile vs reduceByRange (with
summarize = FALSE) different?  I understand why internally you have
different nesting schemes (files and ranges) for the two functions, but
it is not clear to me that it is desirable to have the return object
depend on how the computation was done.
..."

reduceByFile() and reduceFiles() output a list the same length as the
number of files while reduceByRange() and reduceRanges() output a list
the same length as the number of ranges.

Reduction is different depending on which function is chosen; data are
collapsed either within a file or across files. When REDUCE does
something substantial the output are not equivalent.

While it's possible to get the same result (REDUCE simply unlists or
isn't used), the two approaches were not intended to be equivalent ways
of arriving at the same end. The idea was that the user had a specific
use case in mind - they either wanted to collapse the data across or
within files.

(4) return object from coverage(BigWigFileViews):

Previous comment:

"...
coverage(BigWigFileViews) returns a "wrong" assay object in my opinion,
...
Specifically, each (i,j) entry in the object is an RleList with a single
element with a name equal to the seqnames of the i'th entry in the query
GRanges.  To me, this extra nestedness is unnecessary; I would have
expected an Rle instead of an RleList with 1 element.
..."

The return value from coverage(x) is an RleList with one coverage vector
per seqlevel in 'x'. Even if there is only one seqlevel, the result
still comes back as an RleList. This is just the default behavior.

(5) separate the 'read' function from the MAP step

Previous comment:

"...
Also, something completely different, it seems like it would be
convenient for stuff like BigWigFileViews to not have to actually parse
the file in the MAP step.  Somehow I would envision some kind of reading
function, stored inside the object, which just returns an Rle when I ask
for a (range, file).  Perhaps this is better left for later.
..."

The current approach for the reduce* functions is for MAP to both
extract and manipulate data. The idea of separating the extraction step
is actually implemented in reduceByYield(). (This function used to be
yieldReduce() in Rsamtools in past releases.) For reduceByYield() teh
user must specify YIELD (a reader function), MAP, REDUCE and DONE
(criteria to stop iteration).

I'm not sure what is best here. I thought the many-argument approach of
reduceByYield() was possibly confusing or burdensome and so didn't use
it in the other GenomicFiles functions. Maybe it's not confusing but
instead makes the individual steps more clear. What do you think,

- Should the reader function be separate from the MAP? What are the
advantages?

- Should READER, MAP, REDUCE be stored inside the GenomicFiles object or
supplied as arguments to the functions?

(6) unnamed assay in SummarizedExperiment

Previous comment:

"...
The return object of reduceByRange / reduceByFile with summarize = TRUE
is a SummarizedExperiment with an unnamed assay.  I was surprised to see
that this is even possible.
..."

There is no default name for SummarizedExperiment in general. I've named
the assay 'data' for lack of a better term. We could also go with
'reducedData' or another suggestion.

Thanks for the feedback.

Valerie

On 10/01/2014 08:30 AM, Michael Love wrote:

hi Kasper,

For a concrete example, I posted a R and Rout file here:

https://gist.github.com/mikelove/deaff999984dc75f125d

Things to note: 'ranges' is a GRangesList, I cbind() the numeric
vectors in the REDUCE, and then rbind() the final list to get the
desired matrix.

Other than the weird column name 'init', does this give you what you
want?

best,

Mike

On Tue, Sep 30, 2014 at 2:08 PM, Michael Love
<michaelisaiahl...@gmail.com> wrote:

hi Kasper and Valerie,

In Kasper's original email:

"I would like to be able to write a MAP function which takes
   ranges, file
instead of just
   range, file
And then chunk over say 1,000s of ranges. I could then have an
argument to reduceByXX called something like rangeSize, which is kind
of yieldSize."

I was thinking, for your BigWig example, we get part of the way just
by splitting the ranges into a GRangesList of your desired chunk size.
Then the parallel iteration is over chunks of GRanges. Within your MAP
function:

import(file, which = ranges, as = "Rle", format = "bw")[ranges]

returns an RleList, and calling mean() on this gives a numeric vector
of the mean of the coverage for each range.

Then the only work needed is how to package the result into something
reasonable. What you would get now is a list (length of GRangesList)
of lists (length of files) of vectors (length of GRanges).

best,

Mike

On Mon, Sep 29, 2014 at 7:09 PM, Valerie Obenchain
<voben...@fhcrc.org> wrote:

Hi Kasper,

The reduceBy* functions were intended to combine data across ranges or
files. If we have 10 ranges and 3 files you can think of it as a 10
x 3 grid
where we'll have 30 queries and therefore 30 pieces of information
that can
be combined across different dimensions.

The request to 'processes ranges in one operation' is a good
suggestion and,
as you say, may even be the more common use case.

Some action items and explanations:

1)  treat ranges / files as a group

I'll add functions to treat all ranges / files as a group;
essentially no
REDUCER other than concatenation. Chunking (optional) will occur as
defined
by 'yieldSize' in the BamFile.

2)  class clean-up

The GenomicFileViews class was the first iteration. It was overly
complicated and the complication didn't gain us much. In my mind the
GenomicFiles class is a striped down version should replace the
*FileViews
classes. I plan to deprecate the *FileViews classes.

Ideally the class(s) in GenomicFiles would inherit from
SummarizedExperiment. A stand-alone package for SummarizedExperiment
is in
the works for the near future. Until those plans are finalized I'd
rather
not change the inheritance structure in GenomicFiles.

3) pack / unpack

I experimented with this an came to the conclusion that packing /
unpacking
should be left to the user vs being done behind the scenes with
reduceBy*.
The primary difficulty is that you don't have pre-knowledge of the
output
format of the user's MAPPER. If MAPPER outputs an Rle, unpacking may be
straightforward. If MAPPER outputs a NumericList or matrix or vector
with no
genomic coordinates then things are more complicated.

I'm open if others have suggestions / prototypes.

4) reduceByYield

This was intended for chunking through ranges in a single file. You can
imagine using bplapply() over files where each file is chunked
through with
reduceByYield(). All ranges are chunked by 'yieldSize' defined in the
BamFile unless a 'param' dictates a subset of ranges.

5) additional tidy

I'll add a name to the 'assays' when summarize=TRUE, make sure
return types
are consistent when summarize=FALSE, etc.

Thanks for the testing and feedback.


Valerie


On 09/29/2014 07:18 AM, Michael Love wrote:


On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
<kasperdanielhan...@gmail.com> wrote:


I don't fully understand "the use case for reducing by range is
when the
entire dataset won't fit into memory".  The basic assumption of these
functions (as far as I can see) is that the output data fits in
memory.
What may not fit in memory is various earlier "iterations" of the
data.
For
example, in my use case, if I just read in all the data in all the
ranges
in
all the samples it is basically Rle's across 450MB times 38 files,
which
is
not small.  What fits in memory is smaller chunks of this; that is
true
for
every application.


I was unclear. I meant that, in approach1, you have an object,
all.Rle, which contains Rles for every range over every file. Can you
actually run this approach on the full dataset?

Reducing by range (or file) only makes sense when the final output
includes
one entity for several ranges/files ... right?  So I don't see how
reduce
would help me.


Yes, I think we agree. This is not a good use case for reduce by range
as now implemented.

This is a use case which would benefit from the user-facing function
internally calling pack()/unpack() to reduce the number of import()
calls, and then in the end giving back the mean coverage over the
input ranges. I want this too.


https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456

(link to the old github repo, the new github repo is named
GenomicFiles)

As I see the pack()/unpack() paradigm, it just re-orders the query
ranges
(which is super nice and matters a lot for speed for some
applications).
As
I understand the code (and my understanding is developing) we need an
extra
layer to support processing multiple ranges in one operation.

I am happy to help apart from complaining.

Best,
Kasper

On Mon, Sep 29, 2014 at 8:55 AM, Michael Love
<michaelisaiahl...@gmail.com>
wrote:



Thanks for checking it out and benchmarking. We should be more clear
in the docs that the use case for reducing by range is when the
entire
dataset won't fit into memory. Also, we had some discussion and
Valerie had written up methods for packing up the ranges supplied by
the user into a better form for querying files. In your case it
would
have packed many ranges together, to reduce the number of import()
calls like your naive approach. See the pack/unpack functions, which
are not in the vignette but are in the man pages. If I remember,
writing code to unpack() the result was not so simple, and
development
of these functions was set aside for the moment.

Mike

On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
<kasperdanielhan...@gmail.com> wrote:



I am testing GenomicFiles.

My use case: I have 231k ranges of average width 1.9kb and total
width
442
MB.  I also have 38 BigWig files.  I want to compute the average
coverage
of the 38 BigWig files inside each range.  This is similar to
wanting
to
get coverage of - say - all promoters in the human genome.

My data is residing on a file server which is connected to the
compute
node
through ethernet, so basically I have very slow file access.
Also. the
BigWig files are (perhaps unusually) big: ~2GB / piece.

Below I have two approaches, one using basically straightforward
code
and
the other using GenomicFiles (examples are 10 / 100 ranges on 5
files).
Basically GenomicFiles is 10-20x slower than the straightforward
approach.
This is most likely because reduceByRange/reduceByFile processes
each
(range, file) separately.

It seems naturally (to me) to allow some chunking of the mapping
of the
ranges.  My naive approach is fast (I assume) because I read
multiple
ranges through one import() command.  I know I have random
access to
the
BigWig files, but there must still be some overhead of seeking and
perhaps
more importantly instantiating/moving stuff back and forth to
R.  So
basically I would like to be able to write a MAP function which
takes
    ranges, file
instead of just
    range, file
And then chunk over say 1,000s of ranges.  I could then have an
argument
to
reduceByXX called something like rangeSize, which is kind of
yieldSize.

Perhaps this is what is intended for the reduceByYield on BigWig
files?

In a way, this is what is done in the vignette with the
coverage(BAMFILE)
example where tileGenome is essentially constructed by the user to
chunk
the coverage computation.

I think the example of a set of regions I am querying on the files,
will
be
an extremely common usecase going forward. The raw data for all the
regions
together is "too big" but I do a computation on each region to
reduce
the
size.  In this situation, all the focus is on the MAP step.  I
see the
need
for REDUCE in the case of the t-test example in the vignette,
where the
return object is a single "thing" for each base.  But in general, I
think
we will use these file structures a lot to construct things without
REDUCE
(across neither files nor ranges).

Also, something completely different, it seems like it would be
convenient
for stuff like BigWigFileViews to not have to actually parse the
file
in
the MAP step.  Somehow I would envision some kind of reading
function,
stored inside the object, which just returns an Rle when I ask
for a
(range, file).  Perhaps this is better left for later.

Best,
Kasper

Examples

approach1 <- function(ranges, files) {
      ## By hand
      all.Rle <- lapply(files, function(file) {
          rle <- import(file, as = "Rle", which = ranges, format =
"bw")[ranges]
          rle
      })
      print(object.size(all.Rle), units = "auto")
      mat <- do.call(cbind, lapply(all.Rle, function(xx) {
          sapply(xx, mean)
      }))
      invisible(mat)
}

system.time({
mat1 <- approach1(all.grs[1:10], all.files[1:5])
})

160.9 Kb
      user  system elapsed
     1.109   0.001   1.113

system.time({
mat1 <- approach1(all.grs[1:100], all.files[1:5])
}) # less than 4x slower than previous call

3 Mb
      user  system elapsed
     4.101   0.019   4.291


approach2 <- function(ranges, files) {
      gf <- GenomicFiles(rowData = ranges, files = files)
      MAPPER <- function(range, file, ....) {
          library(rtracklayer)
          rle <- import(file, which = range, as = "Rle", format =
"bw")[range]
          mean(rle)
      }
      sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE,
BPPARAM
=
SerialParam())
      sExp
}

system.time({
mat2 <- approach2(all.grs[1:10], all.files[1:5])
})  # 8-9x slower than approach1

     user  system elapsed
     9.349   0.280   9.581

system.time({
mat2 <- approach2(all.grs[1:100], all.files[1:5])
})  # 9x slower than previous call, 20x slow than approach1 on same
input

     user  system elapsed
    89.310   0.627  91.044

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

library(GenomicFiles)
library(microbenchmark)

## 5 bam files ranging from 4e+08 to 8e+08 records:
fls <- BamFileList(c("exp_srx036692.bam", "exp_srx036695.bam",
                     "exp_srx036696.bam", "exp_srx036697.bam",
                     "exp_srx036692.bam")) ## re-use one file

## GRanges with 100 ranges and total width 1e+06:
starts <- sample(1:1e7, 100)
chr <- paste0(rep("chr", 100), 1:22)
grs <- GRanges(chr,  IRanges(starts, width=1e4))


## By hand:
FUN1 <- function(ranges, files) {
    ## equivalent to MAP step
    cvg <- lapply(files,
        FUN = function(file, range) {
            param = ScanBamParam(which=range)
            coverage(file, param=param)[range]
        }, range=ranges)
    ## equivalent to REDUCE step
    do.call(cbind, lapply(cvg, mean))
}

microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10)
>> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10)
> Unit: seconds
>                   expr      min       lq     mean   median       uq      max
>   FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331 1.222298 1.256741
>  FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486 4.278463 4.533846
>  neval
>     10
>     10

## GenomicFiles:
MAP = function(range, file, ...) {
    param = ScanBamParam(which=range)
    coverage(file, param=param)[range]
}
REDUCE <- function(mapped, ...) do.call(cbind, lapply(mapped, mean))

FUN2 <- function(ranges, files) {
    reduceFiles(ranges, files, MAP, REDUCE, BPPARAM=SerialParam())
}

microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10)
>> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10)
> Unit: seconds
>                   expr      min       lq     mean   median       uq      max
>   FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154 1.267655 1.275698
>  FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007 4.384064 4.676068
>  neval
>     10
>     10

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] GenomicFiles: chunking

Reply via email to