Fixed, calling a vector with the names & using seq.names during read.GenBank
all the best a 2012/7/12 Andrés Parada <[email protected]> > Hi all, > > I obtained a using a vector with accession numbers and "read.GenBAnk > (namefile)" The names are "there" since I can obtain a list via attr > (namefile, "species") I couldn't find the way to use write.dna to save a > fasta file with those "species" labels instead of accession numbers. > > I noticed seq.names is no more used under "ape". > *Could you tell me how to save a fasta with species names as labels?* > Thanks in advance, > > a > > > 2012/5/3 Emmanuel Paradis <[email protected]> > >> I made some changes in read.dna which, I hope, solve the problems. The >> taxa names can be of any length and must be separated from the sequences by >> at least one space (or tabulation). write.dna() now follows the same rule. >> Files with less than 10 nucleotides can now be read by read.dna (bug fixed). >> >> I removed the option 'seq.names' of read.dna since it doesn't seem >> particularly useful and this helped to clarify the code. >> >> The new versions are now on ape's SVN: >> >> https://svn.mpl.ird.fr/ape/**dev/ape/R/read.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/read.dna.R> >> https://svn.mpl.ird.fr/ape/**dev/ape/R/write.dna.R<https://svn.mpl.ird.fr/ape/dev/ape/R/write.dna.R> >> >> Tests welcome! >> >> >> Best, >> >> Emmanuel >> >> Dan Rabosky wrote on 26/04/2012 22:01: >> >>> >>> Hi Emmanuel- >>> >>> Thanks for fixing the whitespace issue. I think this fix will be useful >>> to many users. >>> >>> On the issue of recognizing 10 IUPAC characters: I think this is a real >>> problem, and may come up again in short order. Maybe it is just that use of >>> this function has been limited? In the single dataset with a modest number >>> of sequences that caused me problems yesterday, I had the following species >>> and/or genus names - all of which constitute 10 character strings drawn >>> from the set of IUPAC codes: >>> >>> brachyurus (x 2) >>> savannarum >>> graduacauda >>> caudacutus >>> Camarhynchus (x 3) >>> madagascariensis >>> >>> I don't suggest deprecating the phylip sequential, but rather, using >>> something that is compatible with raxml (surely one of the most widely used >>> phylogenetics programs today). I think raxml uses a relaxed sequential >>> version of the phylip format with whitespace delimitation. I could read the >>> same alignment in raxml with no problems, but I had multiple issues when >>> reading the same file with read.dna (including the whitespace character on >>> the first line). My guess is that very few people are using the original >>> phylip format, with its limit of 10 characters per taxon name, and with dna >>> seqs beginning immediately after this. So maybe deprecate "sequential >>> phylip", but you could use what Stamatakis calls "relaxed sequential >>> PHYLIP", which appears to be: (1) taxon names cannot include spaces but can >>> be up to 100 characters; and (2) names separated from sequences by >>> whitespace character (ideally, this should recognize any number of spaces >>> or tabs to prevent user confusion). >>> >>> For users with tab-delimited raxml files (eg each taxon name separated >>> from its dna sequence by a tab), you can use a regular-expressions enabled >>> text editor (like textwrangler) to quickly find potential problems. Just >>> search for >>> >>> [ACGTUMRWSYKVHDBN]{10}.+\t >>> >>> with grep matching enabled. >>> >>> Cheers, >>> ~Dan >>> >>> >>> On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote: >>> >>> Hi Dan, >>>> >>>> The reason for this implementation (searching the first 10 IUPAC-coded >>>> bases) is because the exact formatting is not inconsistent among different >>>> programs. Some files have: >>>> >>>> 0123456789acgt..... >>>> >>>> that is a 10-character name and the sequence starting on the 11th >>>> position. I think this is typical for Phylip. Other software (e.g., PhyML) >>>> accepts longer taxa names and require a space before the start of the >>>> sequence. >>>> >>>> About your example: it depends on the order of the data. The following >>>> file can be read: >>>> >>>> 2 10 >>>> xxxxx AAAAAAAAAA >>>> madagascarAAAAAAAAAA >>>> >>>> But if you invert the two sequence lines, it fails. >>>> >>>> It is the first time I hear about this problem in 9 years, maybe >>>> because it requires a particular combination of circumstances. Another >>>> drawback of this implementation is that files with less than 10 bases >>>> cannot be read. >>>> >>>> How to solve this? If it were left only to me, I would deprecate the >>>> interleaved and sequential formats. FASTA is more flexible, more >>>> widespread, easier to parse, can store exactly the same information, and >>>> labels are only constrained to be on a single line (but can contain any >>>> characters including \n, \t, ...) But I guess many programs use the Phylip >>>> formats, so I'd be glad to read other suggestions. >>>> >>>> As for your 2nd problem, it is now fixed in ape. >>>> >>>> Best, >>>> >>>> Emmanuel >>>> -----Original Message----- >>>> From: Dan Rabosky<[email protected]> >>>> Sender: >>>> r-sig-phylo-bounces@r-project.**org<[email protected]> >>>> Date: Wed, 25 Apr 2012 17:51:35 >>>> To:<[email protected]> >>>> Subject: [R-sig-phylo] read.dna warnings and pitfalls >>>> >>>> >>>> Hi All- >>>> >>>> I have spent an inordinate and embarrassing amount of time tracking >>>> down an excruciatingly cryptic issue with read.dna, which I rarely use. >>>> Here are two key problems: >>>> >>>> 1) The function automatically assumes it is reading DNA sequences when >>>> it encounters a string of 10 continuous "DNA-like" characters. This >>>> includes all characters in the set (ACGTUMRWSYKVHDBN-). This function, >>>> unlike the phylip original, does not have limits on taxon name lengths. >>>> Hence, I had - in the middle of a large alignment - a species whose name >>>> included the string "MADAGASCAR", which caused a failure. To be fair, the >>>> documentation warns of this, but I think this is extremely easy to >>>> overlook, and - moreover - it seems unfortunate to have to parse all your >>>> taxon names for a potential IUPAC match before trying to use the function. >>>> Presumably, most users who specify sequential spacing will be using >>>> whitespace to separate taxon names from DNA sequences, and perhaps it is >>>> better to exploit this rather than IUPAC matching. >>>> >>>> 2) The function is whitespace-sensitive. if you tab-separate the >>>> numbers on the first line (numbers of taxa, numbers of sites), you'll >>>> receive an errror with the message: "the first line of the file must >>>> contain the dimensions of the data". It appears that spaces are OK, >>>> however. >>>> >>>> Hopefully this post will be useful to somewhere in the future with a >>>> similar issue. Perhaps these can be addressed in a future update to ape? >>>> >>>> -Dan Rabosky >>>> >>>> ______________________________**_________________ >>>> R-sig-phylo mailing list >>>> [email protected] >>>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo> >>>> ______________________________**_________________ >>>> R-sig-phylo mailing list >>>> [email protected] >>>> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> -- >> Emmanuel Paradis >> IRD, Jakarta, Indonesia >> http://ape.mpl.ird.fr/ >> >> ______________________________**_________________ >> R-sig-phylo mailing list >> [email protected] >> https://stat.ethz.ch/mailman/**listinfo/r-sig-phylo<https://stat.ethz.ch/mailman/listinfo/r-sig-phylo> >> > > > > -- > > Andrés Parada > Estudiante de Doctorado > Departamento de Ecología > Pontificia Universidad Católica de Chile > -- Andrés Parada Estudiante de Doctorado Departamento de Ecología Pontificia Universidad Católica de Chile [[alternative HTML version deleted]]
_______________________________________________ R-sig-phylo mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
