Re: [R] Umlaut read from csv-file

Heinz Tuechler Sun, 09 Nov 2008 01:56:00 -0800

At 06:25 09.11.2008, Prof Brian Ripley wrote:

On Sat, 8 Nov 2008, Heinz Tuechler wrote:
At 08:01 08.11.2008, Prof Brian Ripley wrote:
We have no idea what you understood (you didn't tell us), but the help says
encoding: character vector.  The encoding(s) to be assumed when 'file'
          is a character string: see 'file'.  A possible value is
          '"unknown"': see the â??Detailsâ??.
...
     This paragraph applies if 'file' is a filename (rather than a
     connection).  If 'encoding = "unknown"', an attempt is made to
     guess the encoding.  The result of 'localeToCharset()' is used as
     a guide.  If 'encoding' has two or more elements, they are tried
     in turn until the file/URL can be read without error in the trial
     encoding.
So source(encoding="latin1") says the file isencoded in Latin-1 and should be re-encoded ifnecessary (e.g. in UTF-8 locale).
Setting the Encoding of parsed character strings is not mentioned.
You could have written out a data frame withwrite.csv() and re-read it withread.csv(encoding = "latin1"): that was theworkaround you were given earlier (not to use source).
Thank you for this explanation. I felt that Idid not understand the help page of source()and I hoped, encoding='latin1' would have thesame effect as in read.csv(), but rethinkingit, I see that it would conflict with the primary functionality of source().Earlier I tried writing the data.frame withwrite.csv and re-reading it. This works, butadditional information like labels(), I have to tranfer in a second step.The best way I could immagine, would be somefunction, which marks every character string inthe whole structure of a data.frame, including all attributes, as latin1.
I think it is possible that

con <- file("foo")
source(con, encoding="latin1")
close(foo)

will also do what you want, although that's an udocumented side effect.


You are right. It does work in my real data problem. Thank you.

(minor remark: I think close(foo) should be close(con))

But all of this should be unnecessary inR-patched (although it is possible that thereare other quirks with unmarked strings lurkingin the shadows, there are no other obvious changes from 2.7.2).
On Sat, 8 Nov 2008, Heinz Tuechler wrote:
At 16:52 07.11.2008, Prof Brian Ripley wrote:
On Fri, 7 Nov 2008, Peter Dalgaard wrote:
Heinz Tuechler wrote:
Dear Prof.Ripley!
Thank you very much for your attention. In the given example Encoding(),
or the encoding parameter of read.csv solve the problem. I hope your
patch will solve also the problem, when I read a spss file by
spss.get(), since this function has no encoding parameter and my real
problem originated there.
read.spss() (package foreign) does have a reencode argument, though; and
this is called by spss.get(), so it looks like an easy hack to add it
there.
Yes, older software like spss.get needs toget updated for the internationalizationage. Modifying it to have a ... argumentpassed to read.spss would be a good idea (and future-proofing).In cases like this it is likely that theSPSS file does contain its encoding(although sometimes it does not andoccasionally it is wrong), so it is helpfulto make use of the info if it isthere. However, the default isread.spss(reencode=NA) because of theproblems of assuming that the info is correct when it is not are worse.
The cause, why I tried the example below wasto solve the encoding by dumping and thenre-sourcing a data.frame with the encodingparameter set to latin1. As you can see,source(x, encoding='latin1') does not havethe effect I expected. Unfortunately I do nothave any idea, what I understood wrongregarding the meaning of encoding='latin1'.
Heinz TÃ¼chler

us <- c("a", "b", "c", "Ã¤", "Ã¶", "Ã¼")
Encoding(us)
[1] "unknown" "unknown" "unknown" "latin1"  "latin1"  "latin1"
dump('us', 'us_dump.txt')
rm(us)
source('us_dump.txt', encoding='latin1')
us
[1] "a" "b" "c" "Ã¤" "Ã¶" "Ã¼"
Encoding(us)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
unlink('us_dump.txt')
--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Umlaut read from csv-file

Reply via email to