Hi Nico,

On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
Hej Bioc Core!

There was some discussion last year about implementing a BamStreamer (à la 
FastqStreamer), but I haven't seen anything like it in the current devel. I've 
implemented the following function that should do the job for me - I have many 
very large files, and I need to use a cluster with relatively few RAM per node 
and a restrictive time allocation , so I want to parallelize the reading of the 
BAM file to manage both. The example below is obviously not affecting the RAM 
issue but I streamlined it to point out my issue.

".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){

  ## create a stream
  stopifnot(is(bamFile,"BamFile"))

  ## set the yieldSize if it is not set already
  if(is.na(yieldSize(bamFile))){
    yieldSize(bamFile) <- yieldSize
  }

  ## open it
  open(bamFile)

  ## verb
  if(verbose){
    message(paste("Streaming",basename(path(bamFile))))
  }

  ## create the output
  out <- GappedAlignments()

  ## process it
  while(length(chunk <- readBamGappedAlignments(bamFile))){
    if(verbose){
      message(paste("Processed",length(chunk),"reads"))
    }
    out <- c(out,chunk)
  }

Note that regardless the speed of c() on GappedAlignments objects,
growing an object in a loop is fundamentally inefficient (see Circle 2
of The R Inferno).
Also keeping the chunks in memory kind of defeats the purpose of reading
the file one chunk at a time.


  ## close
  close(bamFile)

  ## return
  return(out)
}

In the method above, the first iteration of combining the GappedAlignments:

out <- c(out,chunk) takes:

system.time(append(out,chunk))

   user  system elapsed
123.704   0.060 124.011

2 minutes! Whaoo, that's really slow. I can't reproduce this on my
machine though:

  library(Rsamtools)
  library(RNAseqData.HNRNPC.bam.chr14)
  bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
  yieldSize(bamfile) <- 100000L
  open(bamfile)
  out <- GappedAlignments()

Then:

  > chunk <- readBamGappedAlignments(bamfile)
  > system.time(out <- append(out, chunk))
     user  system elapsed
    0.284   0.000   0.286

I wonder what's going on on your system. Are you sure it was not running
out of memory when you did this? Try to check the load with uptime or
top in another terminal (e.g. start top right before you call append()).
If the system starts swapping, then your R process will become hundreds
or thousands times slower!


whereas the second iteration (faked here) takes only (still long):

system.time(append(chunk,chunk))

   user  system elapsed
  2.708   0.044   2.758

2nd, 3rd and 4th iterations for me:

  > chunk <- readBamGappedAlignments(bamfile)
  > system.time(out <- append(out, chunk))
     user  system elapsed
    0.516   0.004   0.521

  > chunk <- readBamGappedAlignments(bamfile)
  > system.time(out <- append(out, chunk))
     user  system elapsed
    0.656   0.008   0.663

  > chunk <- readBamGappedAlignments(bamfile)
  > system.time(out <- append(out, chunk))
     user  system elapsed
    0.796   0.004   0.801

As expected, the time is growing (this is why the process
of growing an object in a loop is considered to be quadratic
in time).


I suppose this has to do with the way 
GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the 
objects and all the related sanity checks. For the first iteration, the 
seqlengths are different so I suppose that is what explains the 60X lag 
compared to the second iteration.

The seqinfo of the 2 objects to combine need to be merged together
and set back on each object before the 2 objects can actually
be combined. This operation is cheap and I wouldn't expect this
to slow down the first iteration significantly.

Due to the implementation of GappedAlignments, I can't set the seqlengths 
programmatically in GappedAlignments() which I imagine would have reduced the 
first iteration lag; see the trials below:

out <- GappedAlignments(seqlengths=seqlengths(chunk))

Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
  'names(seqlengths)' incompatible with 'levels(seqnames)'

out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))

Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = 
seqnames(chunk)) :
  'strand' must be specified when 'seqnames' is not empty

out <- 
GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")

Error in validObject(.Object) :
  invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class 
"GappedAlignments": got class "character", should be or extend class "Rle"
invalid class “GappedAlignments” object: 2: number of rows in DataTable 
'mcols(x)' must match length of 'x'

The trick is to create an empty GappedAlignments objects
with non-empty seqlevels so you can put seqlengths on the
seqlevels.

Here are 2 ways to create an empty GappedAlignments objects with
non-empty seqlevels:

  (1) Pass an empty factor with non-empty levels to the seqnames
      arg:

        out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))

  (2) The recommended way:

        out <- GappedAlignments()
        seqinfo(out) <- seqinfo(chunk)

Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
its seqlengths), not only its seqlevels.

(1) could be adapted to also set the seqlengths:

  out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
                          seqlengths=seqlengths(chunk))

but (2) is really the preferred way.


I completely approve of such sanity checks; it seems that I'm just trying to do 
something that it was not designed for :-) All I'm really interested in is a 
way to stream my BAM file and I'm looking forward to any suggestion. I 
especially don't want to re-invent the wheel if you have already planned 
something. If you haven't I'd be glad to get some insight how I can walk around 
that problem.

My sessionInfo:

R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C                 LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] BiocInstaller_1.11.3  Rsamtools_1.13.22     Biostrings_2.29.12
[4] GenomicRanges_1.13.26 XVector_0.1.0         IRanges_1.19.15
[7] BiocGenerics_0.7.2

loaded via a namespace (and not attached):
[1] bitops_1.0-5   stats4_3.0.1   zlibbioc_1.7.0


Looks like you are using Bioc-devel. Did you get all the
warnings about GappedAlignments, readBamGappedAlignments(),
and GappedAlignments() being deprecated?

I thought you were using the release so that's what I used:

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3
[3] Biostrings_2.28.0                 GenomicRanges_1.12.4
[5] IRanges_1.18.1                    BiocGenerics_0.6.0

loaded via a namespace (and not attached):
[1] bitops_1.0-5   stats4_3.0.0   zlibbioc_1.6.0


The timings I get with Bioc-devel are pretty much the same though.

Something doesn't seem to be quite right with your cluster. What happens
if you try to rbind() 2 data.frames of 100000 rows each in a fresh
session?

  > df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd")
  > system.time(df2 <- rbind(df, df))
     user  system elapsed
    0.204   0.000   0.206

Thanks,
H.


Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to