+1 IMO BioC could adopt the zero-width ranges representation for insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to deal with each corresponding beast, be VCF, dbSNP or the like. Who knows, VCF could also change their representation in the future and it'll be a headache to update the affected packages if we decide to keep using its insertion representation internally to store variant ranges in BioC.
robert. On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote: > There would be a LOT of value in having core packages use exactly the > same representation; I don't have any opinion about which one is > best. > > Kasper > > On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès <hpa...@fredhutch.org > <mailto:hpa...@fredhutch.org>> wrote: > > On 03/16/2015 09:22 AM, Michael Lawrence wrote: > > Yes, I think it would make sense for the Xtra package to follow the > established convention in VariantAnnotation/VCF. > > > I don't know. I agree it would be nice to use a more consistent > representation of insertions across the software but I'm not > convinced we should necessarily follow the VCF way, which is to > include the base before the event in the ref and alt alleles as well > as in the reported range. > > Note that there doesn't seem to be any consensus in the broader > Bioinformatics community either. For example dbSNP and HGVS report > the range that corresponds to the 2 flanking nucleotides but they > don't include these nucleotides in the ref or alt alleles. VCF does > the same as GFF3 which says "start equals end and the implied site is > to the right of the indicated base" except that VCF wants to treat > events that occur at position 1 in a special way. In that case VCF > says the base *after* the event should be included (seems like the > VCF authors want to avoid both: empty ranges and also ranges that > start at POS 0). BTW it doesn't seem that > VariantAnnotation::isInsertion() is aware of that special treatment. > > UCSC uses a zero-width range, and so does the XtraSNPlocs.* > packages. The advantage of this representation is its simplicity. > There is no special cases. It also reflects the notion that an > insertion is a replacement of an empty string with a non-empty > string. No need to include flanking nucleotides in the representation > (which is artificial and distorts the "real" alt allele). > > > H. > > > On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo > <robert.cast...@upf.edu <mailto:robert.cast...@upf.edu>> wrote: > > dear devel people, specially Val and Michael, > > Hervé has recently added an annotation package that includes > non-SNVs variants from dbSNP, it is called: > > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38) > > if you execute the corresponding example you'll see the kind of > information stored in the package: > > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38) > > > if you pay attention to the following case: > > my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns: > seqnames ranges strand | RefSNP_id alleles <Rle> > <IRanges> <Rle> | <character> <character> [1] 22 [10513380, > 10513380] - | rs386831164 -/T [2] 22 [10519678, > 10519677] + | rs71286731 -/TTT ref_allele <DNAStringSet> > [1] T [2] - ------- seqinfo: 25 sequences > (1 circular) from GRCh38 genome > > you'll see the first variant (rs386831164) is a deletion of one > nucleotide and the second (rs71286731) is an insertion of three > nucleotides (TTT). > > it struck me that the ranges representing the insertion had an start > position one nucleotide larger than then and i contacted Hervé > thinking that this was a mistake. however, i've learned from him that > these are so-called "zero-width" ranges and they actually allow to > distinguish insertions from every other type of variant without the > need to know anything about the reference or the alternate allele. > > currently, the VCF specification 4.2 (http://samtools.github.io/ > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the > REF column to help distinguishing insertions by including the > flanking nucleotide of the inserted sequence. As a result, > VariantAnnotation::readVcf() produces ranges that mimic this > standard having identical start and end positions leading to 1-width > ranges: > > fl <- system.file("extdata", "CEUtrio.vcf.bgz", > package="VariantFiltering") vcf <- readVcf(fl, genome="hg19") > rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2 ranges > and 5 metadata columns: seqnames ranges strand | > paramRangeID <Rle> <IRanges> <Rle> | <factor> > rs11474033 20 [45093330, 45093330] * | <NA> > 20:47592746_G/GGC 20 [47592746, 47592746] * | > <NA> REF ALT QUAL FILTER <DNAStringSet> > <DNAStringSetList> <numeric> <character> rs11474033 C > CTTCT 2901.12 . 20:47592746_G/GGC G > GGC 608.88 . ------- seqinfo: 84 sequences from hg19 > genome > > > table(width(rowRanges(vcf[isInsertion(vcf), ]))) > > 1 78 > > i would like to ask whether it would make sense to harmonize the way > in which dbSNP insertions and variants are imported into Bioconductor > by making VariantAnnotation::readVcf() to produce zero-width ranges > for insertion variants. this not a feature request, i only would like > to know what whether there is a particular reason not to use the > available zero-width ranges that seem to be implemented for this > purpose. > > > cheers, > > robert. > > _______________________________________________ > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing > list https://stat.ethz.ch/mailman/listinfo/bioc-devel > <https://stat.ethz.ch/mailman/listinfo/bioc-devel> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing > list https://stat.ethz.ch/mailman/listinfo/bioc-devel > <https://stat.ethz.ch/mailman/listinfo/bioc-devel> > > > -- Hervé Pagès > > Program in Computational Biology Division of Public Health Sciences > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org> Phone: > (206) 667-5791 <tel:%28206%29%20667-5791> Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> > > > _______________________________________________ > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing > list https://stat.ethz.ch/mailman/listinfo/bioc-devel > <https://stat.ethz.ch/mailman/listinfo/bioc-devel> > > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel