VRanges is supposed to enforce strand. The goal is to use "*" always, for simplicity and consistency with the result of readVcf(). Is there a use case for negative strand variants?
On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo <robert.cast...@upf.edu> wrote: > Michael, > > regarding our email exchange three weeks ago, I found a couple of places in > VariantAnnotation that IMO need to be updated to avoid enforcing strand on > VRanges. > > these places occur when constructing and validating VRanges objects, > concretely at: > > 1. file R/methods-VRanges-class.R at the VRanges class constructor: > > VRanges <- > function(seqnames = Rle(), ranges = IRanges(), > ref = character(), alt = NA_character_, > totalDepth = NA_integer_, refDepth = NA_integer_, > altDepth = NA_integer_, ..., sampleNames = NA_character_, > softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), ncol = > 0L), > FilterRules()), > hardFilters = FilterRules()) > { > gr <- GRanges(seqnames, ranges, > strand = .rleRecycleVector("*", length(ranges)), ...) > [...] > > that precludes setting the strand at construction time: > > library(VariantAnnotation) > VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C", strand="-") > Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*", > length(ranges)), : > formal argument "strand" matched by multiple actual arguments > > > 2. R/AllClasses.R at the VRanges class validity function .valid.VRanges(): > > .valid.VRanges.strand <- function(object) { > if (any(object@strand == "-")) > paste("'strand' must always be '+' or '*'") > } > > [...] > > .valid.VRanges <- function(object) > { > c(.valid.VRanges.ref(object), > .valid.VRanges.alt(object), > .valid.VRanges.strand(object), > .valid.VRanges.depth(object)) > } > > that prompts an error when variants annotated on the negative strand are > detected: > > library(VariantAnnotation) > example(VRanges) > strand(vr) <- "-" > c(vr) > Error in validObject(.Object) : > invalid class “VRanges” object: 'strand' must always be '+' or '*' > > > cheers, > > robert. > > On 05/22/2015 09:49 PM, Michael Lawrence wrote: >> >> This changed recently. VariantAnnotation in devel no longer enforces a >> strand on VRanges, or at least it allows the "*" case. >> >> >> On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.cast...@upf.edu >> <mailto:robert.cast...@upf.edu>> wrote: >> >> Hi, >> >> I have encountered myself in a strange situation when using the >> function locateVariants() from VariantAnnotation with an input >> VRanges object. The problem is that some of the expected coding >> annotations are not showing up when using locateVariants() with >> default parameters. >> >> After investigating this situation I think I found the reason, which >> does not look like a bug but I would like that you give me some >> clarification about the logic behind using locateVariants() with >> VRanges objects. >> >> The documentation of the VRanges-class says that in this class of >> objects "The strand is always constrained to be positive (+).". I >> guess there may be a good reason for this but I could not find it in >> the documentation or googling about it. >> >> This means that when you coerce a CollapsedVCF object (obtained, for >> example, from a VCF file via readVcf()) to a VRanges object, even >> though variants in the VCF may have no strand, they get a positive >> strand in the VRanges object. >> >> The problem arises then, when you call locateVariants() with this >> VRanges object, because features on the negative strand are never >> going to overlap with the variants since, by default, the argument >> ignore.strand=FALSE. >> >> Let me illustrate this with a toy example. Consider the SNP >> rs1129038 >> (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at >> chr15:28356859 with allels A/G. It is located on the 3' UTR of the >> gene HERC2 coded on the negative strand of the human reference >> genome. Let's build a toy VRanges object having this variant: >> >> library(VariantAnnotation) >> vr <- VRanges(seqnames="chr15", >> ranges=IRanges(28356859, 28356859), >> ref="A", alt="G", >> refDepth=5, altDepth=7, >> totalDepth=12, sampleNames="A") >> strand(vr) >> factor-Rle of length 1 with 1 run >> Lengths: 1 >> Values : + >> Levels(3): + - * >> >> Let's build now its CollapsedVCF counterpart by using the >> corresponding coercion method and set the strand to "*": >> >> vcf <- asVCF(vr) >> strand(vcf) <- "*" >> >> Now run locateVariants() on both objects with UCSC annotations: >> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> >> locateVariants(vcf, txdb, region=AllVariants()) >> GRanges object with 2 ranges and 9 metadata columns: >> seqnames ranges strand | LOCATION LOCSTART >> LOCEND QUERYID TXID CDSID >> <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> >> <character> <IntegerList> >> [1] chr15 [28356859, 28356859] * | threeUTR 50 50 >> 1 55386 >> [2] chr15 [28356859, 28356859] * | threeUTR 50 50 >> 1 55387 >> GENEID PRECEDEID FOLLOWID >> <character> <CharacterList> <CharacterList> >> [1] 8924 >> [2] 8924 >> ------- >> seqinfo: 1 sequence from an unspecified genome; no seqlengths >> >> locateVariants(vr, txdb, region=AllVariants()) >> GRanges object with 1 range and 9 metadata columns: >> seqnames ranges strand | LOCATION LOCSTART >> LOCEND QUERYID TXID CDSID >> <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> >> <integer> <IntegerList> >> [1] chr15 [28356859, 28356859] + | intergenic <NA> <NA> >> 1 <NA> >> GENEID PRECEDEID FOLLOWID >> <character> <CharacterList> <CharacterList> >> [1] <NA> 100132565,100289656,100616223,... 2567 >> ------- >> seqinfo: 1 sequence from an unspecified genome; no seqlengths >> >> Note that while we get the 3' UTR annotation from the strandless VCF >> object we do not get it from the VRanges object with the positive >> strand. To make my point clear: this positive strand shows up when >> you coerce a strandless VCF object to a VRanges one, because >> positive strandness seems to be the convention for VRanges objects: >> >> as(vcf, VRanges) >> VRanges object with 1 range and 1 metadata column: >> seqnames ranges strand ref >> alt totalDepth refDepth altDepth >> <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> >> <integerOrRle> <integerOrRle> >> [1] chr15 [28356859, 28356859] + A >> G 12 5 7 >> sampleNames softFilterMatrix | QUAL >> <factorOrRle> <matrix> | <numeric> >> [1] A | <NA> >> ------- >> seqinfo: 1 sequence from an unspecified genome; no seqlengths >> hardFilters: NULL >> >> >> Of course, if I run locateVariants() with the argument >> ignore.strand=TRUE, then I get the expected annotation: >> >> locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE) >> GRanges object with 2 ranges and 9 metadata columns: >> seqnames ranges strand | LOCATION LOCSTART >> LOCEND QUERYID TXID CDSID >> <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> >> <character> <IntegerList> >> [1] chr15 [28356859, 28356859] + | threeUTR 677 >> 677 1 55386 >> [2] chr15 [28356859, 28356859] + | threeUTR 677 >> 677 1 55387 >> GENEID PRECEDEID FOLLOWID >> <character> <CharacterList> <CharacterList> >> [1] 8924 >> [2] 8924 >> ------- >> seqinfo: 1 sequence from an unspecified genome; no seqlengths >> >> >> So, my question is, given that VRanges objects are enforced to have >> a positive strand, would not be better to have ignore.strand=TRUE as >> default in locateVariants? >> >> Alternatively, I would recommend that locateVariants() issues a >> warning, or maybe an error, when the input object is VRanges and >> ignore.strand=FALSE. >> >> Finally, out of curiosity, why a VRanges object enforces the >> positive strand in all its genomic ranges? Would not be better just >> taking the strand of the CollapsedVCF object when coercing the >> CollapsedVCF object to VRanges? >> >> >> thanks!! >> >> >> robert. >> >> _______________________________________________ >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing >> list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > > -- > Robert Castelo, PhD > Associate Professor > Dept. of Experimental and Health Sciences > Universitat Pompeu Fabra (UPF) > Barcelona Biomedical Research Park (PRBB) > Dr Aiguader 88 > E-08003 Barcelona, Spain > telf: +34.933.160.514 > fax: +34.933.160.550 > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel