Le 2011-08-15 à 22:24, Duncan Murdoch a écrit :
> On 11-08-15 7:48 PM, Denis Chabot wrote:
>>
>> Le 2011-08-15 à 19:06, Duncan Murdoch a écrit :
>>
>>> On 11-08-15 2:42 PM, Denis Chabot wrote:
>>>> Hi,
>>>>
>>>> I usually do not give second thought to accented vowels and R handles
>>>> everything fine thanks to UTF8 being used in my R scripts. But today I
>>>> have a problem. Accented vowels do not behave properly when they were
>>>> imported into R using list.files.
>>>>
>>>> Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for
>>>> file names, though visually the names seem to have been read correctly
>>>> into R.
>>>>
>>>> An example is better than words:
>>>>
>>>> sessionInfo()
>>>> R version 2.13.1 (2011-07-08)
>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>
>>>> locale:
>>>> [1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>>
>>>> This does not cause problem:
>>>> a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles Sète sda.Rda", "1_MO2
>>>> turbots po2crit.Rda"); a
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2
>>>> turbots po2crit.Rda"
>>>>
>>>> a2 = gsub(" Sète", "S", a); a2
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2
>>>> turbots po2crit.Rda"
>>>>
>>>>
>>>> but if instead of creating the vector within the R script, I read it as a
>>>> series of file names, the substitution does not work. I am sorry that I
>>>> cannot make this a reproducible example as it requires the 3 files to
>>>> exist on your computer, but you could create 3 dummy files having the same
>>>> names in the directory of your choice.
>>>>
>>>> don = file.path("données/")
>>>> b = list.files(path = don, pattern = "1_MO2"); b
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2
>>>> turbots po2crit.Rda"
>>>>
>>>> b2 = gsub(" Sète", "S", b); b2
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2
>>>> turbots po2crit.Rda"
>>>>
>>>> I am puzzled and also "stuck". For now I'll modify the file name, but I
>>>> need to be able to handle such names at some point.
>>>>
>>>> Any advice?
>>>
>>>
>>> Possibly your system really is using MacRoman or some other local encoding;
>>> in that case, iconv(x, "", "UTF-8") should convert from the local encoding
>>> to UTF-8.
>>>
>>> I think declaring everything to be UTF8 may be sufficient. When I use
>>> list.files(), I see the encoding listed as "unknown", but
>>>
>>> x<- list.files()
>>> Encoding(x)<- "UTF-8"
>>>
>>> works. However, the iconv() method should be safer.
>>>
>>> Duncan Murdoch
>>
>> Hi Duncan,
>>
>> iconv() confirmed what I suspected: there was no problem with the encoding
>> of the result of list.files, and if there had been one, the "è" would not
>> have looked like a "è". Therefore, I got nonsense when treating this "è" as
>> MacRoman to be converted into UTF-8:
>>
>> iconv(b, from="MacRoman", to="UTF-8")
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles SeÃÄte sda.Rda" "1_MO2
>> turbots po2crit.Rda"
>>
>> It is not clear however that R considered b to be UTF=8:
>> Encoding(b)
>> [1] "unknown" "unknown" "unknown"
>>
>> so I followed your suggestion:
>>
>> Encoding(b)<- "UTF-8"
>> Encoding(b)
>> [1] "unknown" "UTF-8" "unknown"
>>
>> but gsub still did not work:
>> b2 = gsub(" Sète", "S", b); b2
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2
>> turbots po2crit.Rda"
>>
>> I do not know why gsub worked with example "a" but not "b" in the example
>> shown in my original message. Strange and frustrating.
>
> Unicode sometimes gives different ways to encode what is rendered as the same
> character (e.g. letter + accent versus accented letter). I think (see below)
> the OS uses one convention, but R chooses the other when it parses your text.
>
> Cut and paste did just work for me, in a version of R 2.13.0 Patched which
> predates 2.13.1 by a few weeks; I'm not up to date on my Mac:
>
>
> > x <- list.files()
> > x
> [1] "1_MO2 soles Sète sda.Rda"
> > gsub("Sète", "XXXX", x)
> [1] "1_MO2 soles XXXX sda.Rda"
>
>
>
> In the second line, I didn't try to type the pattern containing Sète, I just
> cut and pasted it from the printed version of x.
>
> One other possibility (and perhaps it's the best one, if your substitutions
> are all so simple) is to use the useBytes=TRUE option to gsub. You can use
> charToRaw to see the bytes in a string, to make sure they are what you expect.
>
> When I do that, I see that the è really is handled differently in the two
> cases:
>
> > charToRaw("Sète") # cut and paste from list.files() output
> [1] 53 65 cc 80 74 65
> > charToRaw("Sète") # entered on the keyboard
> [1] 53 c3 a8 74 65
>
> So your solution is ugly: you'll need to code all your substitutions twice
> (or more!) to handle all the possible ways the same letter could be encoded.
> Or maybe iconv() or some other function has an option to normalize the
> encoding. (I've just read some more about the issue in
> http://en.wikipedia.org/wiki/Unicode_equivalence; normalization is what you
> want to do, but I don't know how to do it.)
>
> Duncan Murdoch
Hi again Duncan,
the "Errors due to normalization differences" part of the article you referred
to seems to confirm your suspicion.
I can get this to work but it is messy:
Sètefileraw = charToRaw(substr(b[2],13,17))
Sètefile = rawToChar(Sètefileraw)
Sètekbraw = charToRaw(substr(a[2],13,16))
Sètekb = rawToChar(Sètekbraw)
c = b
c = gsub(Sètefile, Sètekb, c)
at this point, Sète has become the "keyboard" version and the rest of the
script can work
c2 = gsub(" Sète", "S", c); c2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots
po2crit.Rda"
I'll keep accented vowels out of file names for this project whenever I'll have
to use gsub on them!
Thanks again,
Denis
_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac