Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
________________________________
From: Tal Galili [mailto:[email protected]]
Sent: Friday, March 19, 2010 12:36 AM
To: William Dunlap; [email protected]
Cc: [email protected]
Subject: Re: [R] How to read.table with “Hebrew” column names (in R)?
Hello William, Ista and other R-help members,
The code you suggested:
read.table("http://www.talgalili.com/files/aa.txt",encoding="UTF-8"
,check.names=FALSE, header = T, sep = "\t")
Works for me the same way it does for you: I can read the data in
(finally!), but some of the ways for using it fails (such as the printing, and
the attempt at including column names in "lm")
So first thanks for the help!
Second, could you please supply your sessionInfo() ?
I wonder how your locale is compared to that of Ista, since it looks as
if for Ista there is no problem with the Hebrew.
I was on Windows XP (American/English edition, if that makes
any difference) using a precompiled copy of R 2.11.0 downloaded
from CRAN (the Simon Fraser mirror) and sessionInfo() and
i10n_info() say:
> sessionInfo()
R version 2.11.0 Under development (unstable) (2010-03-07 r51225)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tcltk_2.11.0
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
I cannot set the locale to "Hebrew" (nor to "en_US" or
"en_US.utf8").
> Sys.setlocale("LC_ALL", "Hebrew")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "Hebrew") :
OS reports request to set locale to "Hebrew" cannot be honored
I'd like to learn more about the issue since we've had problems
reading UTF-8 encoded XML files and using the results in R on
Windows.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
Thanks for helping!
Tal
----------------Contact
Details:-------------------------------------------------------
Contact me: [email protected] | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
| www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
On Fri, Mar 19, 2010 at 12:42 AM, William Dunlap <[email protected]>
wrote:
I tried this on R 2.11.0 unstable (2010-03-07 r51225) using
encoding="UTF-8" and check.names=FALSE in read.table().
It seemed to basically work, except that the data.frame/matrix
printing
routine wants to print the Unicode codes for the characters
in the names:
> data1 <- read.table("http://www.talgalili.com/files/aa.txt",
header = TRUE, sep = "\t", encoding="UTF-8",
check.names=FALSE)
> data1 # I see Unicode codes, presumably the correct ones
<U+05D0><U+05D7><U+05EA>
<U+05E9><U+05EA><U+05D9><U+05D9><U+05DD>
1 12
97
2 123
354
3 6
1
<U+05E9><U+05DC><U+05D5><U+05E9>
1 6
2 44
3 3
> colnames(data1) # I see Hebrew strings (in R the first
starts with aleph)
[1] "אחת" "שתיים" "שלוש"
> colnames(data)[1]
[1] "אחת"
> strsplit(colnames(data)[1], "")[[1]][1]
[1] "א"
> data1[,"שתיים"]
[1] 97 354 1
I'm writing this in Outlook in the English (American) locale
and the copy-n-paste from the R gui window to the Outlook window
of the Hebrew letters reversed the whole line of them (reversing
the characters in each name and the names in the line), which I
why I showed a subset of the names and a substring of the first
name.
However, when I try to use lm() with this data.frame then I run
into
trouble, which is probably the same problem as I see in the
data.frame printing:
> lm(`שתיים` ~ `שלוש`)
Error: \uxxxx sequences not supported inside backticks (line
1)
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Tal Galili
> Sent: Thursday, March 18, 2010 2:41 PM
> To: [email protected]
> Subject: [R] How to read.table with “Hebrew” column names (in
R)?
>
> (I am reposting this question after a few months without a
> solution...)
>
>
> Hi all,
>
> I am trying to read a .txt file, with Hebrew column names,
but without
> success.
>
> I uploaded an example file to:
http://www.talgalili.com/files/aa.txt
>
> And tried the command:
>
> read.table("http://www.talgalili.com/files/aa.txt", header =
> T, sep = "\t")
>
> This returns me with:
>
> X.....ª X...ª...... X...œ....
> 1 12 97 6
> 2 123 354 44
> 3 6 1 3
>
> Instead of:
>
> × ×—×ª ×©×ª×™×™× ×©×œ×•×©
> 12 97 6
> 123 354 44
> 6 1 3
>
>
> Trying to use something like:
>
> read.table("http://www.talgalili.com/files/aa.txt",fileEncodin
> g ="iso8859-8")
>
> Has resulted in:
>
> V1
> 1 ?
> Warning messages:
> 1: In read.table("http://www.talgalili.com/files/aa.txt",
fileEncoding
> = "iso8859-8") :
>
> invalid input found on input connection
> 'http://www.talgalili.com/files/aa.txt'
> 2: In read.table("http://www.talgalili.com/files/aa.txt",
fileEncoding
> = "iso8859-8") :
>
> incomplete final line found by readTableHeader on
> 'http://www.talgalili.com/files/aa.txt'
>
> While also trying this:
>
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
>
> Or this:
>
> Sys.setlocale("LC_ALL",
> "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
>
> Get's me this:
>
> [1] ""
> Warning message:
> In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
>
> OS reports request to set locale to "en_US.UTF-8" cannot be
honored
>
>
>
> My output for:
>
> l10n_info()
>
> Is:
>
> $MBCS
> [1] FALSE
>
> $`UTF-8`
> [1] FALSE
>
> $`Latin-1`
> [1] TRUE
>
> $codepage
> [1] 1252
>
> And for:
>
> Sys.getlocale()
>
> Is:
>
> [1] "LC_COLLATE=English_United
States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> Finally, here is the > sessionInfo()
>
> R version 2.10.1 (2009-12-14)
>
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1255
LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
base
>
> loaded via a namespace (and not attached):
> [1] tools_2.10.1
>
>
> Any suggestion or clarification will be appreciated.
>
>
>
> Best,
>
> Tal
>
> ----------------Contact
>
Details:-------------------------------------------------------
> Contact me: [email protected] | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il
> (Hebrew) |
> www.r-statistics.com (English)
> --------------------------------------------------------------
> --------------------------------
>
> [[alternative HTML version deleted]]
>
>
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.