Re: [Bioc-devel] zero-width ranges representing insertions

Valerie Obenchain Thu, 19 Mar 2015 16:05:29 -0700

On 03/16/2015 05:31 PM, Hervé Pagès wrote:

On 03/16/2015 04:06 PM, Michael Lawrence wrote:



On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo <robert.cast...@upf.edu
<mailto:robert.cast...@upf.edu>> wrote:

    +1 IMO BioC could adopt the zero-width ranges representation for
    insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to
    deal with each corresponding beast, be VCF, dbSNP or the like. Who
    knows, VCF could also change their representation in the future and
    it'll be a headache to update the affected packages if we decide to
    keep using its insertion representation internally to store variant
    ranges in BioC.


That would break just about every tool, so let's hope not. There's a
bunch of code on top of Bioc that currently depends on the current
representation. For example, zero width ranges do not overlap anything,
so they need special treatment to e.g. detect whether an insertion falls
within a gene. There are real benefits to keeping the representation of
indels consistent with the rest of the field (VCF). There was much
thought put into this.


Note that findOverlaps() now handles zero-width ranges.


I've had a chance to take a closer look at how VA handles zero-width ranges.

Previously, both predictCoding() and locateVariants() treated zero-widthranges as width 1 (start decremented to equal end). In VA 1.13.42 thishas been changed for predictCoding() so now zero width are dropped. Thefunction internals expect REF and ALT to conform with the vcf specs andzero width ranges aren't used. So, it seemed wise to drop the zero-widthfor now.

locateVariants() remains the same because this is more general. I thinkit's still useful to identify where a zero width range falls withrespect to gene features.


Straight use of findOverlaps() on the ranges of a VCF object leads to
some subtle problems on insertions. For example predictCoding() (which
I guess uses findOverlaps() internally) reports strange things for
these 2 insertions (1 right before and 1 right after the stop codon):

This output is actually fine. The VARCODON values may be slightlymisleading but the data are correct. predictCoding() only computes aminoacid sequences for snps or indels that conform to the 'groups of 3'idea. The substitution or deletion must result in the sequence beingdivisible by 3 otherwise there is a partial codon at the end that mustbe inferred (consider all possible combinations) and then one must bechosen (consensus). The code does not currently do this and I'm not surethere is common agreement on how to do it.


This GRanges has a snv followed by 1, 2, and 3 base pair insertions:

rowRanges(vcf)

GRanges object with 4 ranges and 5 metadata columns:
          seqnames         ranges strand | paramRangeID            REF
             <Rle>      <IRanges>  <Rle> |    <logical> <DNAStringSet>
      snv    chr20 [77055, 77055]      * |         <NA>              T
  1bp ins    chr20 [77054, 77055]      * |         <NA>             AT
  2bp ins    chr20 [77054, 77055]      * |         <NA>             AT
  3bp ins    chr20 [77054, 77055]      * |         <NA>             AT
                         ALT      QUAL      FILTER
          <DNAStringSetList> <numeric> <character>
      snv                  G        70        PASS
  1bp ins                ATC        70        PASS
  2bp ins               ATCG        70        PASS
  3bp ins              ATCGG        70        PASS
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Coding changes are computed for the snv and 3bp insertion but the othersare marked as 'frameshift'. Previously when an indel couldn't betranslated the VARCODON was the same as the REFCODON which may have beenconfusing (was intended to mean nothing has changed). I've changed thisso VARCODON is now missing (like VARAA) when it can't be translated.

predictCoding(vcf, txdb, Hsapiens)

GRanges object with 4 ranges and 17 metadata columns:
          seqnames         ranges strand | paramRangeID            REF
             <Rle>      <IRanges>  <Rle> |    <logical> <DNAStringSet>
      snv    chr20 [77055, 77055]      + |         <NA>              T
  1bp ins    chr20 [77054, 77055]      + |         <NA>             AT
  2bp ins    chr20 [77054, 77055]      + |         <NA>             AT
  3bp ins    chr20 [77054, 77055]      + |         <NA>             AT
                         ALT      QUAL      FILTER      varAllele     CDSLOC
          <DNAStringSetList> <numeric> <character> <DNAStringSet>  <IRanges>
      snv                  G        70        PASS              G [468, 468]
  1bp ins                ATC        70        PASS            ATC [467, 468]
  2bp ins               ATCG        70        PASS           ATCG [467, 468]
  3bp ins              ATCGG        70        PASS          ATCGG [467, 468]
             PROTEINLOC   QUERYID        TXID         CDSID      GENEID
          <IntegerList> <integer> <character> <IntegerList> <character>
      snv           156         1       70477        206101      245938
  1bp ins           156         2       70477        206101      245938
  2bp ins           156         3       70477        206101      245938
  3bp ins           156         4       70477        206101      245938
            CONSEQUENCE       REFCODON       VARCODON         REFAA
               <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
      snv nonsynonymous            AAT            AAG             N
  1bp ins    frameshift            AAT                            N
  2bp ins    frameshift            AAT                            N
  3bp ins nonsynonymous            AAT         AATCGG             N
                  VARAA
          <AAStringSet>
      snv             K
  1bp ins
  2bp ins
  3bp ins            NR
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

PROTEINLOC is the codon number in the coding sequence. These are the cdsregions:

cds

GRanges object with 2 ranges and 3 metadata columns:
      seqnames         ranges strand |    cds_id    cds_name exon_rank
         <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
  [1]    chr20 [68351, 68408]      + |    206100        <NA>         1
  [2]    chr20 [76646, 77058]      + |    206101        <NA>         2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

There are 157 codons, position 77055 falls in the second to last codon,so 156.

sum(width(cds)) / 3

[1] 157

Val


library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

 > cdsBy(txdb, use.names=TRUE)$uc002wcw.3
GRanges object with 2 ranges and 3 metadata columns:
       seqnames         ranges strand |    cds_id    cds_name exon_rank
          <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
   [1]    chr20 [68351, 68408]      + |    206100        <NA>         1
   [2]    chr20 [76646, 77058]      + |    206101        <NA>         2
   -------
   seqinfo: 93 sequences (1 circular) from hg19 genome


library(VariantAnnotation)
 > rowRanges(vcf)  # hand-made VCF
GRanges object with 2 ranges and 5 metadata columns:
                         seqnames         ranges strand | paramRangeID
                            <Rle>      <IRanges>  <Rle> |     <factor>
   ins before stop codon    chr20 [77055, 77055]      * |         <NA>
    ins after stop codon    chr20 [77058, 77058]      * |         <NA>
                                    REF                ALT      QUAL
FILTER
                         <DNAStringSet> <DNAStringSetList> <numeric>
<character>
   ins before stop codon              T                 TG        70
PASS
    ins after stop codon              A                 AG        70
PASS
   -------
   seqinfo: 1 sequence from hg19 genome

Calling predictCoding():

 > library(BSgenome.Hsapiens.UCSC.hg19)
 > predictCoding(vcf, txdb, Hsapiens)
GRanges object with 2 ranges and 17 metadata columns:
                         seqnames         ranges strand | paramRangeID
                            <Rle>      <IRanges>  <Rle> |     <factor>
   ins before stop codon    chr20 [77055, 77055]      + |         <NA>
    ins after stop codon    chr20 [77058, 77058]      + |         <NA>
                                    REF                ALT      QUAL
FILTER
                         <DNAStringSet> <DNAStringSetList> <numeric>
<character>
   ins before stop codon              T                 TG        70
PASS
    ins after stop codon              A                 AG        70
PASS
                              varAllele     CDSLOC    PROTEINLOC   QUERYID
                         <DNAStringSet>  <IRanges> <IntegerList> <integer>
   ins before stop codon             TG [468, 468]           156         1
    ins after stop codon             AG [471, 471]           157         2
                                TXID         CDSID      GENEID CONSEQUENCE
                         <character> <IntegerList> <character>    <factor>
   ins before stop codon       70477                    245938  frameshift
    ins after stop codon       70477                    245938  frameshift
                               REFCODON       VARCODON         REFAA
                         <DNAStringSet> <DNAStringSet> <AAStringSet>
   ins before stop codon            AAT            AAT
    ins after stop codon            TAA            TAA
                                 VARAA
                         <AAStringSet>
   ins before stop codon
    ins after stop codon
   -------
   seqinfo: 1 sequence from hg19 genome

PROTEINLOC, REFCODON, VARCODON, and CONSEQUENCE don't seem quite right
to me. Could be that my hand-made vcf is missing some important data
needeed by predictCoding() though...

H.


    robert.

    On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
    > There would be a LOT of value  in having core packages use
exactly the
     > same representation; I don't have any opinion about which one is
     > best.
     >
     > Kasper
     >
     > On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès
    <hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
     > <mailto:hpa...@fredhutch.org> <mailto:hpa...@fredhutch.org>>
wrote:
     >
     > On 03/16/2015 09:22 AM, Michael Lawrence wrote:
     >
     > Yes, I think it would make sense for the Xtra package to follow
the
     > established convention in VariantAnnotation/VCF.
     >
     >
     > I don't know. I agree it would be nice to use a more consistent
     > representation of insertions across the software but I'm not
     > convinced we should necessarily follow the VCF way, which is to
     > include the base before the event in the ref and alt alleles as
well
     > as in the reported range.
     >
     > Note that there doesn't seem to be any consensus in the broader
     > Bioinformatics community either. For example dbSNP and HGVS report
     > the range that corresponds to the 2 flanking nucleotides but they
     > don't include these nucleotides in the ref or alt alleles. VCF
does
     > the same as GFF3 which says "start equals end and the implied
site is
     > to the right of the indicated base" except that VCF wants to treat
     > events that occur at position 1 in a special way. In that case VCF
     > says the base *after* the event should be included (seems like the
     > VCF authors want to avoid both: empty ranges and also ranges that
     > start at POS 0). BTW it doesn't seem that
     > VariantAnnotation::isInsertion() is aware of that special
treatment.
     >
     > UCSC uses a zero-width range, and so does the XtraSNPlocs.*
     > packages. The advantage of this representation is its simplicity.
     > There is no special cases. It also reflects the notion that an
     > insertion is a replacement of an empty string with a non-empty
     > string. No need to include flanking nucleotides in the
representation
     > (which is artificial and distorts the "real" alt allele).
     >
     >
     > H.
     >
     >
     > On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
     > <robert.cast...@upf.edu <mailto:robert.cast...@upf.edu>
    <mailto:robert.cast...@upf.edu> <mailto:robert.cast...@upf.edu>>
wrote:
     >
     > dear devel people, specially Val and Michael,
     >
     > Hervé has recently added an annotation package that includes
     > non-SNVs variants from dbSNP, it is called:
     >
     > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
     >
     > if you execute the corresponding example you'll see the kind of
     > information stored in the package:
     >
     > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
     >
     >
     > if you pay attention to the following case:
     >
     > my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
     > seqnames               ranges strand |   RefSNP_id alleles <Rle>
     > <IRanges>  <Rle> | <character> <character> [1]       22 [10513380,
     > 10513380]      - | rs386831164         -/T [2]       22 [10519678,
     > 10519677]      + |  rs71286731       -/TTT ref_allele
<DNAStringSet>
     > [1]              T [2]              - ------- seqinfo: 25
sequences
     > (1 circular) from GRCh38 genome
     >
     > you'll see the first variant (rs386831164) is a deletion of one
     > nucleotide and the second (rs71286731) is an insertion of three
     > nucleotides (TTT).
     >
     > it struck me that the ranges representing the insertion had an
start
     > position one nucleotide larger than then and i contacted Hervé
     > thinking that this was a mistake. however, i've learned from
him that
     > these are so-called "zero-width" ranges and they actually allow to
     > distinguish insertions from every other type of variant without
the
     > need to know anything about the reference or the alternate allele.
     >
     > currently, the VCF specification 4.2 (http://samtools.github.io/
     > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition
of the
     > REF column to help distinguishing insertions by including the
     > flanking nucleotide of the inserted sequence. As a result,
     > VariantAnnotation::readVcf() produces ranges that mimic this
     > standard having identical start and end positions leading to
1-width
     > ranges:
     >
     > fl <- system.file("extdata", "CEUtrio.vcf.bgz",
     > package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
     > rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2
ranges
     > and 5 metadata columns: seqnames               ranges strand |
     > paramRangeID <Rle>            <IRanges> <Rle> |     <factor>
     > rs11474033       20 [45093330, 45093330]      * | <NA>
     > 20:47592746_G/GGC       20 [47592746, 47592746]      * |
     > <NA> REF                ALT      QUAL      FILTER <DNAStringSet>
     > <DNAStringSetList> <numeric> <character> rs11474033              C
     > CTTCT   2901.12           . 20:47592746_G/GGC              G
     > GGC    608.88           . ------- seqinfo: 84 sequences from hg19
     > genome
     >
     >
     > table(width(rowRanges(vcf[isInsertion(vcf), ])))
     >
     > 1 78
     >
     > i would like to ask whether it would make sense to harmonize
the way
     > in which dbSNP insertions and variants are imported into
Bioconductor
     > by making VariantAnnotation::readVcf() to produce zero-width
ranges
     > for insertion variants. this not a feature request, i only
would like
     > to know what whether there is a particular reason not to use the
     > available zero-width ranges that seem to be implemented for this
     > purpose.
     >
     >
     > cheers,
     >
     > robert.
     >
     > _______________________________________________
     > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org>
    mailing
     > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
     > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
     >
     >
     > [[alternative HTML version deleted]]
     >
     > _______________________________________________
     > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org>
    mailing
     > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
     > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
     >
     >
     > -- Hervé Pagès
     >
     > Program in Computational Biology Division of Public Health
Sciences
     > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N,
M1-B514
     > P.O. Box 19024 Seattle, WA 98109-1024
     >
     > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
    <mailto:hpa...@fredhutch.org> <mailto:hpa...@fredhutch.org> Phone:
     > (206) 667-5791 <tel:%28206%29%20667-5791>
    <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> Fax:    (206)
    667-1319
     > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
     >
     >
     > _______________________________________________
     > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org>
    mailing
     > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
     > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
     >
     >


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] zero-width ranges representing insertions

Reply via email to