[R] Request for advice on character set conversions (those damn Excel files, again ...)

Emmanuel Charpentier Sun, 07 Sep 2008 15:04:43 -0700

Dear list,
        
I have to read a not-so-small bunch of not-so-small Excel files, which 
seem to have traversed Window 3.1, Windows95 and Windows NT versions of 
the thing (with maybe a Mac or two thrown in for good measure...).
The problem is that 1) I need to read strings, and 2) those 
strings may have various encodings. In the same sheet of the same file, 
some cells may be latin1, some UTF-8 and some CP437 (!).


read.xls() alows me to read those things in sets of dataframes. my 
problem is to convert the encodings to UTF8 without cloberring those who 
are already (looking like) UTF8.

I came to the following solution :

foo<-function(d, from="latin1",to="UTF-8"){
  # Semi-smart conversion of a dataframe between charsets.
  # Needed to ease use of those [EMAIL PROTECTED] Excel files
  # that have survived the Win3.1 --> Win95 --> NT transition,
  # usually in poor shape..
  conv1<-function(v,from,to) {
    condconv<-function(v,from,to) {
      cnv<-is.na(iconv(v,to,to))
      v[cnv]<-iconv(v[cnv],from,to)
      return(v)
    }
    if (is.factor(v)) {
      l<-condconv(levels(v),from,to)
      levels(v)<-l
      return(v)
    }
    else if (is.character(v)) return(condconv(v,from,to))
    else return(v)
  }
  for(i in names(d)) d[,i]<-conv1(d[,i],from,to)
  return(d)
}

Any advice for enhancement is welcome...

Sincerely yours,
        
                                        Emmanuel Charpentier

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Request for advice on character set conversions (those damn Excel files, again ...)

Reply via email to