Re: [R-sig-phylo] read.dna warnings and pitfalls

Nick Matzke Mon, 30 Apr 2012 23:39:52 -0700

I have written several custom mutations of various data-reading
functions to get around some of the common limitations and to read
e.g. ambiguous characters in morphology datasets.


But wouldn't the "best" solution in the long run be to implement the
equivalent of readseq and/or the Nexus Class Library?   I'm not
volunteering ;-)

Cheers, Nick



On Mon, Apr 30, 2012 at 7:19 PM, Emmanuel Paradis
<[email protected]> wrote:
> Hi Dan,
>
> The original motivation behind read.dna() was to allow users to read their
> DNA alignments stored in the Phylip formats -- support for the Clustal
> format came later. You may be right that this is not so frequent. I presume
> a commoner workflow is to first use read.GenBank, then write the sequences
> for analyses with other software. In the situation you describe, that'd
> imply using write.dna. In this function, the rule is (say L is the length of
> the longest taxon name): if L < 11 then the 1st nucleotide is written at the
> 11th column in the output file, otherwise at the (L+2)th column with a space
> at the (L+1)th one.
>
> In the short term, I can change this in both read.dna and write.dna and
> impose a space (or a tabulation) between the longest taxon name and the
> first nucleotide. This would imply, of course, that taxa names cannot have
> spaces. What do others think?
>
> In the long term, I think we may discuss deprecating the sequential and
> interleaved formats for the reasons I listed below. For instance, Clustal
> can output its alignments in FASTA, Muscle outputs by default in this
> format. This is an open discussion.
>
> Best,
>
> Emmanuel
>
> Dan Rabosky wrote on 26/04/2012 22:01:
>>
>>
>> Hi Emmanuel-
>>
>> Thanks for fixing the whitespace issue. I think this fix will be useful to
>> many users.
>>
>> On the issue of recognizing 10 IUPAC characters: I think this is a real
>> problem, and may come up again in short order. Maybe it is just that use of
>> this function has been limited? In the single dataset with a modest number
>> of sequences that caused me problems yesterday, I had the following species
>> and/or genus names - all of which constitute 10 character strings drawn from
>> the set of IUPAC codes:
>>
>> brachyurus (x 2)
>> savannarum
>> graduacauda
>> caudacutus
>> Camarhynchus (x 3)
>> madagascariensis
>>
>> I don't suggest deprecating the phylip sequential, but rather, using
>> something that is compatible with raxml (surely one of the most widely used
>> phylogenetics programs today). I think raxml uses a relaxed sequential
>> version of the phylip format with whitespace delimitation. I could read the
>> same alignment in raxml with no problems, but I had multiple issues when
>> reading the same file with read.dna (including the whitespace character on
>> the first line). My guess is that very few people are using the original
>> phylip format, with its limit of 10 characters per taxon name, and with dna
>> seqs beginning immediately after this. So maybe deprecate "sequential
>> phylip", but you could use what Stamatakis calls "relaxed sequential
>> PHYLIP", which appears to be: (1) taxon names cannot include spaces but can
>> be up to 100 characters; and (2) names separated from sequences by
>> whitespace character (ideally, this should recognize any number of spaces or
>> tabs to prevent user confusion).
>>
>> For users with tab-delimited raxml files (eg each taxon name separated
>> from its dna sequence by a tab), you can use a regular-expressions enabled
>> text editor (like textwrangler) to quickly find potential problems. Just
>> search for
>>
>> [ACGTUMRWSYKVHDBN]{10}.+\t
>>
>> with grep matching enabled.
>>
>> Cheers,
>> ~Dan
>>
>>
>> On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote:
>>
>>> Hi Dan,
>>>
>>> The reason for this implementation (searching the first 10 IUPAC-coded
>>> bases) is because the exact formatting is not inconsistent among different
>>> programs. Some files have:
>>>
>>> 0123456789acgt.....
>>>
>>> that is a 10-character name and the sequence starting on the 11th
>>> position. I think this is typical for Phylip. Other software (e.g., PhyML)
>>> accepts longer taxa names and require a space before the start of the
>>> sequence.
>>>
>>> About your example: it depends on the order of the data. The following
>>> file can be read:
>>>
>>> 2 10
>>> xxxxx     AAAAAAAAAA
>>> madagascarAAAAAAAAAA
>>>
>>> But if you invert the two sequence lines, it fails.
>>>
>>> It is the first time I hear about this problem in 9 years, maybe because
>>> it requires a particular combination of circumstances. Another drawback of
>>> this implementation is that files with less than 10 bases cannot be read.
>>>
>>> How to solve this? If it were left only to me, I would deprecate the
>>> interleaved and sequential formats. FASTA is more flexible, more widespread,
>>> easier to parse, can store exactly the same information, and labels are only
>>> constrained to be on a single line (but can contain any characters including
>>> \n, \t, ...) But I guess many programs use the Phylip formats, so I'd be
>>> glad to read other suggestions.
>>>
>>> As for your 2nd problem, it is now fixed in ape.
>>>
>>> Best,
>>>
>>> Emmanuel
>>> -----Original Message-----
>>> From: Dan Rabosky<[email protected]>
>>> Sender: [email protected]
>>> Date: Wed, 25 Apr 2012 17:51:35
>>> To:<[email protected]>
>>> Subject: [R-sig-phylo] read.dna warnings and pitfalls
>>>
>>>
>>> Hi All-
>>>
>>> I have spent an inordinate and embarrassing amount of time tracking down
>>> an excruciatingly cryptic issue with read.dna, which I rarely use. Here are
>>> two key problems:
>>>
>>> 1) The function automatically assumes it is reading DNA sequences when it
>>> encounters a string of 10 continuous "DNA-like" characters. This includes
>>> all characters in the set (ACGTUMRWSYKVHDBN-). This function, unlike the
>>> phylip original, does not have limits on taxon name lengths. Hence, I had -
>>> in the middle of a large alignment - a species whose name included the
>>> string "MADAGASCAR", which caused a failure.  To be fair, the documentation
>>> warns of this, but I think this is extremely easy to overlook, and -
>>> moreover - it seems unfortunate to have to parse all your taxon names for a
>>> potential IUPAC match before trying to use the function. Presumably, most
>>> users who specify sequential spacing will be using whitespace to separate
>>> taxon names from DNA sequences, and perhaps it is better to exploit this
>>> rather than IUPAC matching.
>>>
>>> 2) The function is whitespace-sensitive. if you tab-separate the numbers
>>> on the first line (numbers of taxa, numbers of sites), you'll receive an
>>> errror with the message: "the first line of the file must contain the
>>> dimensions of the data". It appears that spaces are OK, however.
>>>
>>> Hopefully this post will be useful to somewhere in the future with a
>>> similar issue. Perhaps these can be addressed in a future update to ape?
>>>
>>> -Dan Rabosky
>>>
>>> _______________________________________________
>>> R-sig-phylo mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>> _______________________________________________
>>> R-sig-phylo mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>>
>>>
>>
>>
>>
>>
>>
>>
>
> --
> Emmanuel Paradis
> IRD, Jakarta, Indonesia
> http://ape.mpl.ird.fr/
>
>
> _______________________________________________
> R-sig-phylo mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

_______________________________________________
R-sig-phylo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] read.dna warnings and pitfalls

Reply via email to