Re: [R-sig-phylo] read.dna warnings and pitfalls

Andrés Parada Fri, 13 Jul 2012 09:37:47 -0700

Fixed, calling a vector with the names & using seq.names during read.GenBank


all the best

a



2012/7/12 Andrés Parada <[email protected]>

> Hi all,
>
> I obtained a using a vector with accession numbers and "read.GenBAnk
> (namefile)" The names are "there" since I can obtain a list via attr
> (namefile, "species") I couldn't find the way to use write.dna to save a
> fasta file with those "species" labels instead of accession numbers.
>
> I noticed seq.names is no more used under "ape".
> *Could you tell me how to save a fasta with species names as labels?*
> Thanks in advance,
>
> a
>
>
> 2012/5/3 Emmanuel Paradis <[email protected]>
>
>> I made some changes in read.dna which, I hope, solve the problems. The
>> taxa names can be of any length and must be separated from the sequences by
>> at least one space (or tabulation). write.dna() now follows the same rule.
>> Files with less than 10 nucleotides can now be read by read.dna (bug fixed).
>>
>> I removed the option 'seq.names' of read.dna since it doesn't seem
>> particularly useful and this helped to clarify the code.
>>
>> The new versions are now on ape's SVN:
>>
>> https://svn.mpl.ird.fr/ape/**dev/ape/R/read.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/read.dna.R>
>> https://svn.mpl.ird.fr/ape/**dev/ape/R/write.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/write.dna.R>
>>
>> Tests welcome!
>>
>>
>> Best,
>>
>> Emmanuel
>>
>> Dan Rabosky wrote on 26/04/2012 22:01:
>>
>>>
>>> Hi Emmanuel-
>>>
>>> Thanks for fixing the whitespace issue. I think this fix will be useful
>>> to many users.
>>>
>>> On the issue of recognizing 10 IUPAC characters: I think this is a real
>>> problem, and may come up again in short order. Maybe it is just that use of
>>> this function has been limited? In the single dataset with a modest number
>>> of sequences that caused me problems yesterday, I had the following species
>>> and/or genus names - all of which constitute 10 character strings drawn
>>> from the set of IUPAC codes:
>>>
>>> brachyurus (x 2)
>>> savannarum
>>> graduacauda
>>> caudacutus
>>> Camarhynchus (x 3)
>>> madagascariensis
>>>
>>> I don't suggest deprecating the phylip sequential, but rather, using
>>> something that is compatible with raxml (surely one of the most widely used
>>> phylogenetics programs today). I think raxml uses a relaxed sequential
>>> version of the phylip format with whitespace delimitation. I could read the
>>> same alignment in raxml with no problems, but I had multiple issues when
>>> reading the same file with read.dna (including the whitespace character on
>>> the first line). My guess is that very few people are using the original
>>> phylip format, with its limit of 10 characters per taxon name, and with dna
>>> seqs beginning immediately after this. So maybe deprecate "sequential
>>> phylip", but you could use what Stamatakis calls "relaxed sequential
>>> PHYLIP", which appears to be: (1) taxon names cannot include spaces but can
>>> be up to 100 characters; and (2) names separated from sequences by
>>> whitespace character (ideally, this should recognize any number of spaces
>>> or tabs to prevent user confusion).
>>>
>>> For users with tab-delimited raxml files (eg each taxon name separated
>>> from its dna sequence by a tab), you can use a regular-expressions enabled
>>> text editor (like textwrangler) to quickly find potential problems. Just
>>> search for
>>>
>>> [ACGTUMRWSYKVHDBN]{10}.+\t
>>>
>>> with grep matching enabled.
>>>
>>> Cheers,
>>> ~Dan
>>>
>>>
>>> On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote:
>>>
>>>  Hi Dan,
>>>>
>>>> The reason for this implementation (searching the first 10 IUPAC-coded
>>>> bases) is because the exact formatting is not inconsistent among different
>>>> programs. Some files have:
>>>>
>>>> 0123456789acgt.....
>>>>
>>>> that is a 10-character name and the sequence starting on the 11th
>>>> position. I think this is typical for Phylip. Other software (e.g., PhyML)
>>>> accepts longer taxa names and require a space before the start of the
>>>> sequence.
>>>>
>>>> About your example: it depends on the order of the data. The following
>>>> file can be read:
>>>>
>>>> 2 10
>>>> xxxxx     AAAAAAAAAA
>>>> madagascarAAAAAAAAAA
>>>>
>>>> But if you invert the two sequence lines, it fails.
>>>>
>>>> It is the first time I hear about this problem in 9 years, maybe
>>>> because it requires a particular combination of circumstances. Another
>>>> drawback of this implementation is that files with less than 10 bases
>>>> cannot be read.
>>>>
>>>> How to solve this? If it were left only to me, I would deprecate the
>>>> interleaved and sequential formats. FASTA is more flexible, more
>>>> widespread, easier to parse, can store exactly the same information, and
>>>> labels are only constrained to be on a single line (but can contain any
>>>> characters including \n, \t, ...) But I guess many programs use the Phylip
>>>> formats, so I'd be glad to read other suggestions.
>>>>
>>>> As for your 2nd problem, it is now fixed in ape.
>>>>
>>>> Best,
>>>>
>>>> Emmanuel
>>>> -----Original Message-----
>>>> From: Dan Rabosky<[email protected]>
>>>> Sender: 
>>>> r-sig-phylo-bounces@r-project.**org<[email protected]>
>>>> Date: Wed, 25 Apr 2012 17:51:35
>>>> To:<[email protected]>
>>>> Subject: [R-sig-phylo] read.dna warnings and pitfalls
>>>>
>>>>
>>>> Hi All-
>>>>
>>>> I have spent an inordinate and embarrassing amount of time tracking
>>>> down an excruciatingly cryptic issue with read.dna, which I rarely use.
>>>> Here are two key problems:
>>>>
>>>> 1) The function automatically assumes it is reading DNA sequences when
>>>> it encounters a string of 10 continuous "DNA-like" characters. This
>>>> includes all characters in the set (ACGTUMRWSYKVHDBN-). This function,
>>>> unlike the phylip original, does not have limits on taxon name lengths.
>>>> Hence, I had - in the middle of a large alignment - a species whose name
>>>> included the string "MADAGASCAR", which caused a failure.  To be fair, the
>>>> documentation warns of this, but I think this is extremely easy to
>>>> overlook, and - moreover - it seems unfortunate to have to parse all your
>>>> taxon names for a potential IUPAC match before trying to use the function.
>>>> Presumably, most users who specify sequential spacing will be using
>>>> whitespace to separate taxon names from DNA sequences, and perhaps it is
>>>> better to exploit this rather than IUPAC matching.
>>>>
>>>> 2) The function is whitespace-sensitive. if you tab-separate the
>>>> numbers on the first line (numbers of taxa, numbers of sites), you'll
>>>> receive an errror with the message: "the first line of the file must
>>>> contain the dimensions of the data". It appears that spaces are OK, 
>>>> however.
>>>>
>>>> Hopefully this post will be useful to somewhere in the future with a
>>>> similar issue. Perhaps these can be addressed in a future update to ape?
>>>>
>>>> -Dan Rabosky
>>>>
>>>> ______________________________**_________________
>>>> R-sig-phylo mailing list
>>>> [email protected]
>>>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo>
>>>> ______________________________**_________________
>>>> R-sig-phylo mailing list
>>>> [email protected]
>>>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> --
>> Emmanuel Paradis
>> IRD, Jakarta, Indonesia
>> http://ape.mpl.ird.fr/
>>
>> ______________________________**_________________
>> R-sig-phylo mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo>
>>
>
>
>
> --
>
> Andrés Parada
> Estudiante de Doctorado
> Departamento de Ecología
> Pontificia Universidad Católica de Chile
>



-- 

Andrés Parada
Estudiante de Doctorado
Departamento de Ecología
Pontificia Universidad Católica de Chile

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] read.dna warnings and pitfalls

Reply via email to