dear devel people, specially Val and Michael,

Hervé has recently added an annotation package that includes non-SNVs variants from dbSNP, it is called:

library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)

if you execute the corresponding example you'll see the kind of information stored in the package:

example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)


if you pay attention to the following case:

my_snps1[1:2]
GRanges object with 2 ranges and 3 metadata columns:
      seqnames               ranges strand |   RefSNP_id     alleles
         <Rle>            <IRanges>  <Rle> | <character> <character>
  [1]       22 [10513380, 10513380]      - | rs386831164         -/T
  [2]       22 [10519678, 10519677]      + |  rs71286731       -/TTT
          ref_allele
      <DNAStringSet>
  [1]              T
  [2]              -
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38 genome

you'll see the first variant (rs386831164) is a deletion of one nucleotide and the second (rs71286731) is an insertion of three nucleotides (TTT).

it struck me that the ranges representing the insertion had an start position one nucleotide larger than then and i contacted Hervé thinking that this was a mistake. however, i've learned from him that these are so-called "zero-width" ranges and they actually allow to distinguish insertions from every other type of variant without the need to know anything about the reference or the alternate allele.

currently, the VCF specification 4.2 (http://samtools.github.io/hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF column to help distinguishing insertions by including the flanking nucleotide of the inserted sequence. As a result, VariantAnnotation::readVcf() produces ranges that mimic this standard having identical start and end positions leading to 1-width ranges:

fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")
vcf <- readVcf(fl, genome="hg19")
rowRanges(vcf[isInsertion(vcf), ])[1:2]
GRanges object with 2 ranges and 5 metadata columns:
                    seqnames               ranges strand | paramRangeID
                       <Rle>            <IRanges>  <Rle> |     <factor>
         rs11474033       20 [45093330, 45093330]      * |         <NA>
  20:47592746_G/GGC       20 [47592746, 47592746]      * |         <NA>
                               REF                ALT      QUAL      FILTER
                    <DNAStringSet> <DNAStringSetList> <numeric> <character>
         rs11474033              C              CTTCT   2901.12           .
  20:47592746_G/GGC              G                GGC    608.88           .
  -------
  seqinfo: 84 sequences from hg19 genome


table(width(rowRanges(vcf[isInsertion(vcf), ])))

 1
78

i would like to ask whether it would make sense to harmonize the way in which dbSNP insertions and variants are imported into Bioconductor by making VariantAnnotation::readVcf() to produce zero-width ranges for insertion variants. this not a feature request, i only would like to know what whether there is a particular reason not to use the available zero-width ranges that seem to be implemented for this purpose.


cheers,

robert.

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to