On 05/01/2014 02:05 PM, Ryan wrote:
Hi Valerie,
On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:
I have some concerns about the *ExtraArgs() functions. Passing
flexible args to findOverlaps in the existing mode functions
fundamentally changes the documented behavior. The modes were
created
to capture specific overlap situations pertaining to gene
features
which are graphically depicted in the vignette. Changing
'maxgap' or
'minoverlap' will produce a variety of results inconsistent with
past
behavior and difficult to document (e.g., under what
circumstances
will IntersectionNotEmpty now register a hit).
Well, I wasn't so sure about those functions either. Obviously
you can
pass arguments that break things. They were mostly designed to be
constructors for specific counting modes involving the
minoverlap/maxgap
arguments, but I decided I didn't need those modes after all.
They're
certainly not designed to be exposed to the user. I haven't
carefully
considered the interaction between the counting mode and
maxgap/minoverlap, but I believe that it would be roughly
equivalent to
extending/shrinking the features/reads by the specified amount
(with
some differences for e.g. a feature/read smaller than
2*minoverlap).
For
example, with a read length of 100 and a minoverlap of 10 in Union
counting mode, this would be the same as truncating the first and
last
10 (or mabe 9?) bases and operating in normal Union mode. As I
said,
though, there may be edge cases that I haven't thought of where
unexpected things happen.
I agree that controlling the overlap args is appealing and I
like the
added ability to resize. I've created a 'chipseq' mode that
combines
these ideas and gives what ResizeReads() was doing but now in
'mode'
form. If this implementation gives you the flexibility you were
looking for I'll check it into devel.
This sounds nice, but if I use the 'chipseq' mode, how do I
specify
whether I want Union, IntersectionNotEmpty, or
IntersectionStrict? It
looks like it just does Union? IntersectionStrict would be useful
for
specifying that the read has to occur entirely within the bounds
of a
called peak, for example. This is why I implemented it as a
"wrapper"
that takes another mode as an argument, so that the resizing
logic and
the counting logic were independent.
'chipseq' didn't implement the standard modes b/c I wanted to
avoid the
clash of passing unconventional findOverlaps() args to those
modes. The
assumption was that if the user specified a mode they would
expect a
certain behavior ...
Maybe summarizeOverlaps could
accept an optional "read modification function", and if this is
provided, it will pass the reads through this before passing
them to
the
counting function. The read modification function would have to
take
any
valid reads argument and return another valid reads argument. It
could
be used for modifying the reads as well as filtering them. This
would
allow resizing without the awkward nesting method that I've used.
Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow
massaging
before counting. I'll post back when it's done.
Valerie
A couple of questions:
- Do you want to handle paired-end reads? You coerce to a
GRanges to
resize but don't coerce back.
For paired end reads, there is no need to estimate the fragment
length,
because the pair gives you both ends of the fragment. So if I had
paired-end ChIP-Seq data, I would use it as is with no resizing. I
can't
personally think of a reason to resize a paired-end fragment,
but I
don't know if others might need that.
I corece to GRanges because I know how GRanges work, but I'm
not as
familiar with GAlignments so I don't know how the resize function
works
on GAlignments and other classes. I'm sure you know better than
I do
how
these work. If the coercion is superfluous, feel free to
eliminate it.
- Do you want to require strand info for all reads? Is this
because of
how resize() anchors "*" to 'start'?
Yes, I require strand info for all reads because the reads must be
directionally extended, which requires strand info. Ditto for
counting
the 5-prime and 3-prime ends.
-Ryan
chipseq <- function(features, reads, ignore.strand=FALSE,
inter.feature=TRUE,
type="any", maxgap=0L, minoverlap=1L,
width=NULL, fix="start", use.names=TRUE)
{
reads <- as(reads, "GRanges")
if (any(strand(reads) == "*"))
stop("all reads must have strand")
if (!is.null(width))
reads <- do.call(resize(reads, width, fix=fix,
use.names=use.names,
ignore.strand=ignore.strand))
ov <- findOverlaps(features, reads, type=type,
ignore.strand=ignore.strand,
maxgap=maxgap, minoverlap=minoverlap)
if (inter.feature) {
## Remove reads that overlap multiple features.
reads_to_keep <- which(countSubjectHits(ov) == 1L)
ov <- ov[subjectHits(ov) %in% reads_to_keep]
}
countQueryHits(ov)
}
To count the overlaps of 5' and 3' ends:
summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
summarizeOverlaps(reads, features, chipseq, fix="end", width=1)
Valerie
On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:
No, I forgot to attach the file. Here is the link:
https://www.dropbox.com/s/7qghtksl3mbvlsl/counting-modes.R
On Wed 30 Apr 2014 02:18:28 PM PDT, Valerie Obenchain wrote:
Hi Ryan,
These sound like great contributions. I didn't get an
attachment
- did
you send one?
Thanks.
Valerie
On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:
Hi all,
I recently asked about ways to do non-standard read
counting in
summarizeOverlaps, and Martin Morgan directed me toward
writing a
custom
function to pass as the "mode" parameter. I have now
written the
custom
modes that I require for counting my ChIP-Seq reads, and I
figured I
would contribute them back in case there was interest in
merging
them.
The main three functions are "ResizeReads", "FivePrimeEnd",
and
"ThreePrimeEnd". The first allows you to directionally
extend or
shorten
each read to the effective fragment length for the purpose of
determining overlaps. For example, if each read represents the
5-prime
end of a 150-bp fragment and you want to count these fragments
using the
Union mode, you could do:
summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
fix="start"), ...)
Note that ResizeReads takes a mode argument. It returns a
function
(with
a closure storing the passed arguments) that performs the
resizing
(by
coercing reads to GRanges and calling "resize") and then
dispatches to
the provided mode. (It probably needs to add a call to
"match.fun"
somewhere.)
The other two functions are designed to count overlaps of
only the
read
ends. They are implemented internally using "ResizeReads" with
width=1.
The other three counting modes (the "*ExtraArgs" functions)
are
meant to
be used to easily construct new counting modes. Each function
takes
any
number of arguments and returns a counting mode that works
like
the
standard one of the same name, except that those arguments are
passed as
extra args to "findOverlaps". For example, you could do Union
mode
with
a requirement for a minimum overlap of 10:
summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)
Note that these can be combined or "nested". For instance, you
might
want a fragment length of 150 and a min overlap of 10:
myCountingMode <-
ResizeReads(mode=UnionExtraArgs(minoverlap=10),
width=150, fix="start")
summarizeOverlaps(mode=myCountingMode, ...)
Anyway, if you think any of these are worthy of inclusion for
BioConductor, feel free to add them in. I'm not so sure about
the
"nesting" idea, though. Functions that return functions (with
states
saved in closures, which are then passed into another
function)
are
confusing for people who are not programmers by trade. Maybe
summarizeOverlaps should just gain an argument to pass args to
findOverlaps.
-Ryan Thompson
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel