Well, thanks for sending me the files but I'm sorry to be rather pessimistic as for now...
that's exactly what I was suspecting after a first look at the data in your first email... The short answer is: an obsolete IPA transcription system is used in the files so the student should rework the data. The long answer follows... To summarize, there are (were, only 2 are still conventional) 3 ways of displaying phonetic transcriptions (well there are several other systems but no need to complicate things and these conceptually fall into either of these 3 categories): - the old one consists in using specific fonts that would display specific characters from the same range (256 positions in the font table) as regular fonts. So one had to change font in order to display phonetics AND if one didn't own the specific font used by the original author, one could never be sure that a replacement font would do the job as the same characters may sometimes correspond to different positions in the encoding when using different fonts... - the current one (since at least 10 years I would say) consists in using unicode fonts, and to take advantage of the IPA range for which several fonts provide glyphs (among which Sil Doulos and DejaVu which respectively provide serif and sanserif IPA fonts along with the "standard" (=lots of) characters. - an alternate solution (especially good for computer manipulation) stands in SAMPA and X-SAMPA (http://www.phon.ucl.ac.uk/home/sampa/), two related solutions using only characters in the ascii range and, provided one knows the conventions for coding, will let anyone transcribe phonetics even with a typewriter! This is often a good choice for analysing data by computer as one does not need to know the Unicode hexadecimal number to type when manipulating the characters. But it is sometimes desirable to have both SAMPA and Unicode coding in the same file (automatic generations from one to the other are rather easy) as SAMPA is easier to use when manipulating character strings on the keyboard but IPA unicode glyphs are easier to interpret for most linguists when reading / looking at the data. Depending on what you plan to do with the phonetic transcripts in the analysis process, there may be arguments in favor of either SAMPA/X-SAMPA or IPA or both. So... Apart from the fact that the tabulated data will be a real pain to organize due to what seems to be incoherent data coding with statistical analysis in mind (but that was not part of the question), I see that the font which is used to display phonetic characters is: "Ipa-samd Uclphon1 SILDoulosL" (no technical relationship at all with the Sil Doulos mentionned above). Here, libreoffice does not display anything else than "squares". Though obviously I haven't got this font on my computer, I can read the expected font name, so I had a quick look on the net and found this page: http://www.phon.ucl.ac.uk/shop/fonts.php (where it obsiously from as this was, years ago, a font that was disseminated by the speech community at UCL, as its name may imply). which states, with clear warnings that: "Please note: These fonts are now "legacy fonts": obsolete, symbol-encoded fonts. Their use in new documents is discouraged. If you decide to download and use these, please note there is no user support for them. If your university or organization requires the use of these fonts, please request they change their requirement to one of the Unicode-encoded font which contains the complete IPA repertoire. Many such fonts are now available, and several are supplied with all new computers. Others are available from SIL." Unfortunately, this clearly corresponds to the first case mentionned above: usage of an obsolete IPA transcription system requiring a specific font, but most of all, making data transfer particularly difficult if not impossible due to discrepancies between positions in the font encoding and "standard" glyph (or shape) representations. I'm certain that this message has been on UCL web site for several years now! Though one may discuss the opportunity of keeping such fonts available for download, one cannot say it's not clear from their web page that it should not be used. So, first step... tell the student to use "state-of-the-art" font coding for phonetic transcriptions (which is either IPA with unicode encoding, either SAMPA) which means that he/she must rework all the transcriptions in his/her files. Perhaps, while doing that, tell him/her to think about a better solution for storing data than these tables where 90% of the cells are empty... Sorry to be of no help here but I really see no point at trying to solve issues when obsolete solutions are the main reason of these issues... Of course someone on the list may be more optimistic than I am. Anyway, once the student has come back with either SAMPA or unicode encoding, I would happily provide advice to working with IPA characters within R. Yours sincerely. Olivier. -- Olivier Crouzet, PhD Laboratoire de Linguistique -- EA3827 Université de Nantes Chemin de la Censive du Tertre - BP 81227 44312 Nantes cedex 3 France phone: (+33) 02 40 14 14 05 (lab.) (+33) 02 40 14 14 36 (office) fax: (+33) 02 40 14 13 27 e-mail: olivier.crou...@univ-nantes.fr http://www.lling.univ-nantes.fr/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.