[Bioc-devel] Fwd: [BioC] Problem reading VCF file using readVcf from package VariantAnnotation

Vincent Carey Thu, 25 Apr 2013 05:51:35 -0700

I am falling behind.  I wanted to have a prototype of the following but I
won't do it for a while.

We need to make it more convenient to get the right seqinfo metadata in
place for a wide variety of artifacts.

i propose that there has to be a softer landing for unmet expectations, and
some "intelligent" revision capabilities.

seqInfo(Homo.sapiens) should succeed and be the default target metadata
representation for any contemporary human GRanges.

if i have a GRanges with as.character(1:22) as autosome seqlevels and i
compare to something with UCSC seqlevels the 1:22 object should be revised
for the computation with a warning and a hint on how to get it into
compliance.  or fail with the hint.  and the tools for revising should be
present.

i realize that perfect implementation may not be possible but let not the
perfect be the enemy of the good.

---------- Forwarded message ----------
From: Ulrich Bodenhofer <bodenho...@bioinf.jku.at>
Date: Thu, Apr 25, 2013 at 4:56 AM
Subject: Re: [BioC] Problem reading VCF file using readVcf from package
VariantAnnotation
To: Valerie Obenchain <voben...@fhcrc.org>, bioconduc...@r-project.org

Hi Valerie,

First of all, thanks a lot for your very quick and helpful reply! Your hint
to the function from GenomicRanges helped me to find the problem. So,
finally, I can safely say that there is no problem in readVcf that caused
this issue. The problem was that I supplied a single character string "b37"
as genome argument to readVcf(). I did not pay attention to whether this
string has a name attribute (which was, unfortunately, the case).
Obviously, readVcf checks whether the names of the 'genome' argument match
the sequence names in the VCF file. Knowing that, the error message makes
perfect sense to me! I now stripped off the names from the string I am
passing as 'genome' argument, and it works great! Interestingly, however,
if I pass a whole vector to the 'genome' argument, the error persists even
if the sequence names match perfectly. Anyway, I don't care, my solution to
pass a single string without a name is absolutely acceptable for me.

Thanks again and best regards,
Ulrich

On 04/24/2013 06:06 PM, Valerie Obenchain wrote:

> Hi Ulrich,
>
> In addition to reading genomic regions with 'which' in ScanVcfParam(), you
> can also use 'yieldSize' in TabixFile() to iterate through a VCF file in
> chunks. See example at the bottom of ?readVcf man page or ?TabixFile.
>
> The error you are seeing is thrown from a function in GenomicRanges,
> specifically .normargSeqlengths(). This function is called when checking
> compatibility of the Seqinfo object(s).
>
> How was the GRanges 'param' created? Did it come from a different genome?
> I'm not clear on why the GRanges has potentially different chromosome
> names, lengths and genome? If the VCF file is 'b37', all data (names,
> lengths, genome if present) in the GRanges must be compatible with that.
>
> Please provide a sample of the code you've tried and the output error.
> Specifically, it would be helpful to see seqinfo(GRanges) and
> meta(header(vcf)).
>
> Valerie
>
> On 04/24/2013 04:53 AM, Ulrich Bodenhofer wrote:
>
>> Hi,
>>
>> I am trying to read genotype data from a large VCF file using the
>> readVcf() function from the VariantAnnotation package. I am not reading
>> the entire file (which would crash my R session because of a lack of
>> memory). Instead, I am reading bunches of SNV data located in 200kbp
>> regions which I specify by passing a GRanges object to ScanVcfParams()
>> first. No matter what I do, I get the following error message:
>>
>>     when the supplied 'genome' vector is named, the names must match the
>> seqnames
>>
>> As far as I can make sense of this message, it seems that there is some
>> mismatch between the genome characteristics in my GRanges object and the
>> genome characteristics in the VCF file. I dissected the R object
>> returned by scanVcfHeader() and indeed found some interesting
>> mismatches: The genome in the VCF file is denoted as "b37" and the
>> sequence names are not 100% compatible with hg19. The lengths of
>> chromosomes 1-22, X, and Y do match, but the lengths of mitochondrial
>> DNA (denoted "M" in gh19 and "MT" in b37) differ by 2. So I forced my
>> GRanges object to be 100% compatible with the information stored in the
>> VCF file (by copying seqlevels, genome, and seqlengths) and restricted
>> my analysis to chromosomes 1-22 and X. However, I still get the same
>> error message.
>>
>> I also tried to locate the error message in the source code of the
>> VariantAnnotation package to understand better what the problem is, but
>> I could not find it. It seems the message is produced by a function that
>> VariantAnnotation calls from another package.
>>
>> Any idea?
>>
>> Thanks in advance and best regards,
>> Ulrich
>>
>
______________________________**_________________
Bioconductor mailing list
bioconduc...@r-project.org
https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
Search the archives: http://news.gmane.org/gmane.**
science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Fwd: [BioC] Problem reading VCF file using readVcf from package VariantAnnotation

Reply via email to