Gérald Jean wrote: > Hello, > > I use: > > R version 2.9.2 (2009-08-24) > Copyright (C) 2009 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > > on Ubuntu 9.10, I usually run R from ESS (5.4 on current Unbuntu) from > Emacs-22.2.1. But I also tried the following from the console and it > gave the same results. > > I have a data file containing lots of European characters, French, > German, Italian and so on. I can read it ok in R but I can't display > the characters correctly. > > I searched the archives and following professor Ripley's advice I read > my data the following way: > >> con <- file("/home/gerald/Vins/ListeVin091123.csv", open = "r", > encoding = "UTF-8") >> isOpen(con) > [1] TRUE >> ttt <- read.table(file = con, header = TRUE, sep = ";", quote = "\"'", > + dec = ",", # row.names, col.names, > + na.strings = "", colClasses = NA, nrows = -1, > + skip = 0, check.names = TRUE, > + strip.white = FALSE, blank.lines.skip = TRUE, > + comment.char = "#", > + allowEscapes = FALSE, flush = FALSE, > + stringsAsFactors = FALSE) >> close(con) > > It seems that R does recognize the locales since it tries to report > errors in French here is a simple example: > >> ttt.g <- "gérald" > Erreur : caractères multioctets incorrects dans l'analyse de code > (parser) à la ligne 1
Looks like R is speaking UTF-8 and you're not. Or rather, your console isn't. You may need to poke around to change that -- I think most terminal emulators these days allow you to set the encoding from their menu bar. However, the printout below doesn't quite look like UTF-8, more like one of the older ISO646 mechanisms, so you may still have some work to do. Then again, if OO can read the original file, maybe I am worrying too soon.... -p > outputting the colnames of my data set I get: > >> names(ttt) > [1] "ID" "Domaine" "Nom" "MillÃÆÃ.sime" > "Pays" > [6] "RÃÆÃ.gion" "Appellation" "Vignoble" "Couleur" > "Alcool" > [11] "Classement" "Cuve" "mois" "Bio" > "CÃÆÃ.page..1" > [16] "X." "CÃÆÃ.page..2" "X..1" "CÃÆÃ.page..3" > "X..2" > [21] "CÃÆÃ.page..4" "X..3" "CÃÆÃ.page..5" "X..4" > "Prix" > [26] "QuantitÃÆÃ." "Internet" > > sessionInfo yields the following: > >> sessionInfo() > R version 2.9.2 (2009-08-24) > i486-pc-linux-gnu > > locale: > LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=C; > LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=fr_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C; > LC_MEASUREMENT=fr_CA.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] Revobase_0.2-1 > > I tried to play with Emacs' coding systems with no luck! Any idea on > how to handle this? > > My ultimate goal is to clean up and sort this data set and then export > it in a LaTeX compatible format. > > By the way, if I open the file with OpenOffice Calc it asks me to > confirm that the encoding is Unicode UTF-8, I do, change the default > delimiter to ";" and press enter. All the accented characters display > OK. > > Thanks for any insights, > > Gérald Jean > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.