There would be a LOT of value in having core packages use exactly the same representation; I don't have any opinion about which one is best.
Kasper On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > On 03/16/2015 09:22 AM, Michael Lawrence wrote: > >> Yes, I think it would make sense for the Xtra package to follow the >> established convention in VariantAnnotation/VCF. >> > > I don't know. I agree it would be nice to use a more consistent > representation of insertions across the software but I'm not convinced > we should necessarily follow the VCF way, which is to include the base > before the event in the ref and alt alleles as well as in the reported > range. > > Note that there doesn't seem to be any consensus in the broader > Bioinformatics community either. For example dbSNP and HGVS report the > range that corresponds to the 2 flanking nucleotides but they don't > include these nucleotides in the ref or alt alleles. VCF does the same > as GFF3 which says "start equals end and the implied site is to the > right of the indicated base" except that VCF wants to treat events that > occur at position 1 in a special way. In that case VCF says the base > *after* the event should be included (seems like the VCF authors want > to avoid both: empty ranges and also ranges that start at POS 0). > BTW it doesn't seem that VariantAnnotation::isInsertion() is aware of > that special treatment. > > UCSC uses a zero-width range, and so does the XtraSNPlocs.* packages. > The advantage of this representation is its simplicity. There is no > special cases. It also reflects the notion that an insertion is > a replacement of an empty string with a non-empty string. No need > to include flanking nucleotides in the representation (which is > artificial and distorts the "real" alt allele). > > > H. > > >> On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo <robert.cast...@upf.edu> >> wrote: >> >> dear devel people, specially Val and Michael, >>> >>> Hervé has recently added an annotation package that includes non-SNVs >>> variants from dbSNP, it is called: >>> >>> library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38) >>> >>> if you execute the corresponding example you'll see the kind of >>> information stored in the package: >>> >>> example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38) >>> >>> >>> if you pay attention to the following case: >>> >>> my_snps1[1:2] >>> GRanges object with 2 ranges and 3 metadata columns: >>> seqnames ranges strand | RefSNP_id alleles >>> <Rle> <IRanges> <Rle> | <character> <character> >>> [1] 22 [10513380, 10513380] - | rs386831164 -/T >>> [2] 22 [10519678, 10519677] + | rs71286731 -/TTT >>> ref_allele >>> <DNAStringSet> >>> [1] T >>> [2] - >>> ------- >>> seqinfo: 25 sequences (1 circular) from GRCh38 genome >>> >>> you'll see the first variant (rs386831164) is a deletion of one >>> nucleotide >>> and the second (rs71286731) is an insertion of three nucleotides (TTT). >>> >>> it struck me that the ranges representing the insertion had an start >>> position one nucleotide larger than then and i contacted Hervé thinking >>> that this was a mistake. however, i've learned from him that these are >>> so-called "zero-width" ranges and they actually allow to distinguish >>> insertions from every other type of variant without the need to know >>> anything about the reference or the alternate allele. >>> >>> currently, the VCF specification 4.2 (http://samtools.github.io/ >>> hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF >>> column to help distinguishing insertions by including the flanking >>> nucleotide of the inserted sequence. As a result, >>> VariantAnnotation::readVcf() produces ranges that mimic this standard >>> having identical start and end positions leading to 1-width ranges: >>> >>> fl <- system.file("extdata", "CEUtrio.vcf.bgz", >>> package="VariantFiltering") >>> vcf <- readVcf(fl, genome="hg19") >>> rowRanges(vcf[isInsertion(vcf), ])[1:2] >>> GRanges object with 2 ranges and 5 metadata columns: >>> seqnames ranges strand | paramRangeID >>> <Rle> <IRanges> <Rle> | <factor> >>> rs11474033 20 [45093330, 45093330] * | <NA> >>> 20:47592746_G/GGC 20 [47592746, 47592746] * | <NA> >>> REF ALT QUAL >>> FILTER >>> <DNAStringSet> <DNAStringSetList> <numeric> >>> <character> >>> rs11474033 C CTTCT 2901.12 >>> . >>> 20:47592746_G/GGC G GGC 608.88 >>> . >>> ------- >>> seqinfo: 84 sequences from hg19 genome >>> >>> >>> table(width(rowRanges(vcf[isInsertion(vcf), ]))) >>> >>> 1 >>> 78 >>> >>> i would like to ask whether it would make sense to harmonize the way in >>> which dbSNP insertions and variants are imported into Bioconductor by >>> making VariantAnnotation::readVcf() to produce zero-width ranges for >>> insertion variants. this not a feature request, i only would like to know >>> what whether there is a particular reason not to use the available >>> zero-width ranges that seem to be implemented for this purpose. >>> >>> >>> cheers, >>> >>> robert. >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel