Re: [R-sig-phylo] read.dna warnings and pitfalls

Andrés Parada Thu, 12 Jul 2012 19:18:50 -0700

Hi all,

I obtained a using a vector with accession numbers and "read.GenBAnk
(namefile)" The names are "there" since I can obtain a list via attr
(namefile, "species") I couldn't find the way to use write.dna to save a
fasta file with those "species" labels instead of accession numbers.


I noticed seq.names is no more used under "ape".
*Could you tell me how to save a fasta with species names as labels?*
Thanks in advance,

a

2012/5/3 Emmanuel Paradis <[email protected]>

> I made some changes in read.dna which, I hope, solve the problems. The
> taxa names can be of any length and must be separated from the sequences by
> at least one space (or tabulation). write.dna() now follows the same rule.
> Files with less than 10 nucleotides can now be read by read.dna (bug fixed).
>
> I removed the option 'seq.names' of read.dna since it doesn't seem
> particularly useful and this helped to clarify the code.
>
> The new versions are now on ape's SVN:
>
> https://svn.mpl.ird.fr/ape/**dev/ape/R/read.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/read.dna.R>
> https://svn.mpl.ird.fr/ape/**dev/ape/R/write.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/write.dna.R>
>
> Tests welcome!
>
>
> Best,
>
> Emmanuel
>
> Dan Rabosky wrote on 26/04/2012 22:01:
>
>>
>> Hi Emmanuel-
>>
>> Thanks for fixing the whitespace issue. I think this fix will be useful
>> to many users.
>>
>> On the issue of recognizing 10 IUPAC characters: I think this is a real
>> problem, and may come up again in short order. Maybe it is just that use of
>> this function has been limited? In the single dataset with a modest number
>> of sequences that caused me problems yesterday, I had the following species
>> and/or genus names - all of which constitute 10 character strings drawn
>> from the set of IUPAC codes:
>>
>> brachyurus (x 2)
>> savannarum
>> graduacauda
>> caudacutus
>> Camarhynchus (x 3)
>> madagascariensis
>>
>> I don't suggest deprecating the phylip sequential, but rather, using
>> something that is compatible with raxml (surely one of the most widely used
>> phylogenetics programs today). I think raxml uses a relaxed sequential
>> version of the phylip format with whitespace delimitation. I could read the
>> same alignment in raxml with no problems, but I had multiple issues when
>> reading the same file with read.dna (including the whitespace character on
>> the first line). My guess is that very few people are using the original
>> phylip format, with its limit of 10 characters per taxon name, and with dna
>> seqs beginning immediately after this. So maybe deprecate "sequential
>> phylip", but you could use what Stamatakis calls "relaxed sequential
>> PHYLIP", which appears to be: (1) taxon names cannot include spaces but can
>> be up to 100 characters; and (2) names separated from sequences by
>> whitespace character (ideally, this should recognize any number of spaces
>> or tabs to prevent user confusion).
>>
>> For users with tab-delimited raxml files (eg each taxon name separated
>> from its dna sequence by a tab), you can use a regular-expressions enabled
>> text editor (like textwrangler) to quickly find potential problems. Just
>> search for
>>
>> [ACGTUMRWSYKVHDBN]{10}.+\t
>>
>> with grep matching enabled.
>>
>> Cheers,
>> ~Dan
>>
>>
>> On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote:
>>
>>  Hi Dan,
>>>
>>> The reason for this implementation (searching the first 10 IUPAC-coded
>>> bases) is because the exact formatting is not inconsistent among different
>>> programs. Some files have:
>>>
>>> 0123456789acgt.....
>>>
>>> that is a 10-character name and the sequence starting on the 11th
>>> position. I think this is typical for Phylip. Other software (e.g., PhyML)
>>> accepts longer taxa names and require a space before the start of the
>>> sequence.
>>>
>>> About your example: it depends on the order of the data. The following
>>> file can be read:
>>>
>>> 2 10
>>> xxxxx     AAAAAAAAAA
>>> madagascarAAAAAAAAAA
>>>
>>> But if you invert the two sequence lines, it fails.
>>>
>>> It is the first time I hear about this problem in 9 years, maybe because
>>> it requires a particular combination of circumstances. Another drawback of
>>> this implementation is that files with less than 10 bases cannot be read.
>>>
>>> How to solve this? If it were left only to me, I would deprecate the
>>> interleaved and sequential formats. FASTA is more flexible, more
>>> widespread, easier to parse, can store exactly the same information, and
>>> labels are only constrained to be on a single line (but can contain any
>>> characters including \n, \t, ...) But I guess many programs use the Phylip
>>> formats, so I'd be glad to read other suggestions.
>>>
>>> As for your 2nd problem, it is now fixed in ape.
>>>
>>> Best,
>>>
>>> Emmanuel
>>> -----Original Message-----
>>> From: Dan Rabosky<[email protected]>
>>> Sender: 
>>> r-sig-phylo-bounces@r-project.**org<[email protected]>
>>> Date: Wed, 25 Apr 2012 17:51:35
>>> To:<[email protected]>
>>> Subject: [R-sig-phylo] read.dna warnings and pitfalls
>>>
>>>
>>> Hi All-
>>>
>>> I have spent an inordinate and embarrassing amount of time tracking down
>>> an excruciatingly cryptic issue with read.dna, which I rarely use. Here are
>>> two key problems:
>>>
>>> 1) The function automatically assumes it is reading DNA sequences when
>>> it encounters a string of 10 continuous "DNA-like" characters. This
>>> includes all characters in the set (ACGTUMRWSYKVHDBN-). This function,
>>> unlike the phylip original, does not have limits on taxon name lengths.
>>> Hence, I had - in the middle of a large alignment - a species whose name
>>> included the string "MADAGASCAR", which caused a failure.  To be fair, the
>>> documentation warns of this, but I think this is extremely easy to
>>> overlook, and - moreover - it seems unfortunate to have to parse all your
>>> taxon names for a potential IUPAC match before trying to use the function.
>>> Presumably, most users who specify sequential spacing will be using
>>> whitespace to separate taxon names from DNA sequences, and perhaps it is
>>> better to exploit this rather than IUPAC matching.
>>>
>>> 2) The function is whitespace-sensitive. if you tab-separate the numbers
>>> on the first line (numbers of taxa, numbers of sites), you'll receive an
>>> errror with the message: "the first line of the file must contain the
>>> dimensions of the data". It appears that spaces are OK, however.
>>>
>>> Hopefully this post will be useful to somewhere in the future with a
>>> similar issue. Perhaps these can be addressed in a future update to ape?
>>>
>>> -Dan Rabosky
>>>
>>> ______________________________**_________________
>>> R-sig-phylo mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo>
>>> ______________________________**_________________
>>> R-sig-phylo mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
> --
> Emmanuel Paradis
> IRD, Jakarta, Indonesia
> http://ape.mpl.ird.fr/
>
> ______________________________**_________________
> R-sig-phylo mailing list
> [email protected]
> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo>
>



-- 

Andrés Parada
Estudiante de Doctorado
Departamento de Ecología
Pontificia Universidad Católica de Chile

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] read.dna warnings and pitfalls

Reply via email to