Hi all, I obtained a using a vector with accession numbers and "read.GenBAnk (namefile)" The names are "there" since I can obtain a list via attr (namefile, "species") I couldn't find the way to use write.dna to save a fasta file with those "species" labels instead of accession numbers.
I noticed seq.names is no more used under "ape". *Could you tell me how to save a fasta with species names as labels?* Thanks in advance, a 2012/5/3 Emmanuel Paradis <[email protected]> > I made some changes in read.dna which, I hope, solve the problems. The > taxa names can be of any length and must be separated from the sequences by > at least one space (or tabulation). write.dna() now follows the same rule. > Files with less than 10 nucleotides can now be read by read.dna (bug fixed). > > I removed the option 'seq.names' of read.dna since it doesn't seem > particularly useful and this helped to clarify the code. > > The new versions are now on ape's SVN: > > https://svn.mpl.ird.fr/ape/**dev/ape/R/read.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/read.dna.R> > https://svn.mpl.ird.fr/ape/**dev/ape/R/write.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/write.dna.R> > > Tests welcome! > > > Best, > > Emmanuel > > Dan Rabosky wrote on 26/04/2012 22:01: > >> >> Hi Emmanuel- >> >> Thanks for fixing the whitespace issue. I think this fix will be useful >> to many users. >> >> On the issue of recognizing 10 IUPAC characters: I think this is a real >> problem, and may come up again in short order. Maybe it is just that use of >> this function has been limited? In the single dataset with a modest number >> of sequences that caused me problems yesterday, I had the following species >> and/or genus names - all of which constitute 10 character strings drawn >> from the set of IUPAC codes: >> >> brachyurus (x 2) >> savannarum >> graduacauda >> caudacutus >> Camarhynchus (x 3) >> madagascariensis >> >> I don't suggest deprecating the phylip sequential, but rather, using >> something that is compatible with raxml (surely one of the most widely used >> phylogenetics programs today). I think raxml uses a relaxed sequential >> version of the phylip format with whitespace delimitation. I could read the >> same alignment in raxml with no problems, but I had multiple issues when >> reading the same file with read.dna (including the whitespace character on >> the first line). My guess is that very few people are using the original >> phylip format, with its limit of 10 characters per taxon name, and with dna >> seqs beginning immediately after this. So maybe deprecate "sequential >> phylip", but you could use what Stamatakis calls "relaxed sequential >> PHYLIP", which appears to be: (1) taxon names cannot include spaces but can >> be up to 100 characters; and (2) names separated from sequences by >> whitespace character (ideally, this should recognize any number of spaces >> or tabs to prevent user confusion). >> >> For users with tab-delimited raxml files (eg each taxon name separated >> from its dna sequence by a tab), you can use a regular-expressions enabled >> text editor (like textwrangler) to quickly find potential problems. Just >> search for >> >> [ACGTUMRWSYKVHDBN]{10}.+\t >> >> with grep matching enabled. >> >> Cheers, >> ~Dan >> >> >> On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote: >> >> Hi Dan, >>> >>> The reason for this implementation (searching the first 10 IUPAC-coded >>> bases) is because the exact formatting is not inconsistent among different >>> programs. Some files have: >>> >>> 0123456789acgt..... >>> >>> that is a 10-character name and the sequence starting on the 11th >>> position. I think this is typical for Phylip. Other software (e.g., PhyML) >>> accepts longer taxa names and require a space before the start of the >>> sequence. >>> >>> About your example: it depends on the order of the data. The following >>> file can be read: >>> >>> 2 10 >>> xxxxx AAAAAAAAAA >>> madagascarAAAAAAAAAA >>> >>> But if you invert the two sequence lines, it fails. >>> >>> It is the first time I hear about this problem in 9 years, maybe because >>> it requires a particular combination of circumstances. Another drawback of >>> this implementation is that files with less than 10 bases cannot be read. >>> >>> How to solve this? If it were left only to me, I would deprecate the >>> interleaved and sequential formats. FASTA is more flexible, more >>> widespread, easier to parse, can store exactly the same information, and >>> labels are only constrained to be on a single line (but can contain any >>> characters including \n, \t, ...) But I guess many programs use the Phylip >>> formats, so I'd be glad to read other suggestions. >>> >>> As for your 2nd problem, it is now fixed in ape. >>> >>> Best, >>> >>> Emmanuel >>> -----Original Message----- >>> From: Dan Rabosky<[email protected]> >>> Sender: >>> r-sig-phylo-bounces@r-project.**org<[email protected]> >>> Date: Wed, 25 Apr 2012 17:51:35 >>> To:<[email protected]> >>> Subject: [R-sig-phylo] read.dna warnings and pitfalls >>> >>> >>> Hi All- >>> >>> I have spent an inordinate and embarrassing amount of time tracking down >>> an excruciatingly cryptic issue with read.dna, which I rarely use. Here are >>> two key problems: >>> >>> 1) The function automatically assumes it is reading DNA sequences when >>> it encounters a string of 10 continuous "DNA-like" characters. This >>> includes all characters in the set (ACGTUMRWSYKVHDBN-). This function, >>> unlike the phylip original, does not have limits on taxon name lengths. >>> Hence, I had - in the middle of a large alignment - a species whose name >>> included the string "MADAGASCAR", which caused a failure. To be fair, the >>> documentation warns of this, but I think this is extremely easy to >>> overlook, and - moreover - it seems unfortunate to have to parse all your >>> taxon names for a potential IUPAC match before trying to use the function. >>> Presumably, most users who specify sequential spacing will be using >>> whitespace to separate taxon names from DNA sequences, and perhaps it is >>> better to exploit this rather than IUPAC matching. >>> >>> 2) The function is whitespace-sensitive. if you tab-separate the numbers >>> on the first line (numbers of taxa, numbers of sites), you'll receive an >>> errror with the message: "the first line of the file must contain the >>> dimensions of the data". It appears that spaces are OK, however. >>> >>> Hopefully this post will be useful to somewhere in the future with a >>> similar issue. Perhaps these can be addressed in a future update to ape? >>> >>> -Dan Rabosky >>> >>> ______________________________**_________________ >>> R-sig-phylo mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo> >>> ______________________________**_________________ >>> R-sig-phylo mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo> >>> >>> >>> >> >> >> >> >> >> > -- > Emmanuel Paradis > IRD, Jakarta, Indonesia > http://ape.mpl.ird.fr/ > > ______________________________**_________________ > R-sig-phylo mailing list > [email protected] > https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo> > -- Andrés Parada Estudiante de Doctorado Departamento de Ecología Pontificia Universidad Católica de Chile [[alternative HTML version deleted]]
_______________________________________________ R-sig-phylo mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
