Hi all Four questions regarding Unicode.
Three Windows questions. I am using - a PC with Windows XP (Build 20600.xpsp080413-2111 (Service Pack 3); - the following R version: > R.version platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 7.0 year 2008 month 04 day 22 svn rev 45424 language R version.string R version 2.7.0 (2008-04-22) - the following locale: > Sys.getlocale(category = "LC_ALL") [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" # I loaded the file # <http://www.linguistics.ucsb.edu/faculty/stgries/teaching/russ_corp.txt> # into R, and this works fine. x<-scan(choose.files(), what="char", sep="\n", quote="", comment.char="", encoding="UTF-8") # My problems are the following: # 1 strsplit # This does not work: words.1<-unlist(strsplit(corpus.file, "[-!;:\'\"\\?\\. ]+", perl=T)) # - words.1[173] should be "фирме", as in corpus.file[6] # but it is "фирме" # - words.1[208] should be "Торговли", as in corpus.file[13] # but it is "Торговли" # - words.1[214] should be "клиентов", as in corpus.file[14] # but it is "Торговли" # 2 entering Unicode characters into R: I want to search for, # say, "для". So I try to define it as follows, # but this doesn't work: (x123<-"\u0434\u043b\u044F") # I can define each individual character (x1<-"\u0434"); (x2<-"\u043b"); (x3<-"\u044F") # and each pair of character (x12<-"\u0434\u043b") (x13<-"\u0434\u044F") (x23<-"\u043b\u044F") # but not all three ... the last one gets skipped. # why's that and how do I do it? # 3 defining Unicode character ranges: in each of the following, # the last bracket does not get included (even if it gets defined # as a Unicode character, too): russ.char.yes<-"[\u0401\u0410-\u044F\u0451]" # all Russian Cyrillics russ.char.no<-"[^\u0401\u0410-\u044F\u0451]" # other characters russ.char.capit<-"[\u0410-\u042F\u0451]" # capital Russian Cyrillics russ.char.small<-"[\u0430-\u044F\u0401]" # small Russian Cyrillics # I can do that all on Linux, but this arises in a context where # many other character processing issues are explained for Mac, # Linux, *and* Windows, and I'd hate to have to say "this one # thing, you can't do on Windows" One Linux question. I am using Ubuntu Hardy Heron: > sessionInfo() R version 2.7.0 (2008-04-22) i486-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base # strange(?) behavior of word boundary characters: # I understand why these work ... grep("\\bмолод", "а молодость", perl=F, value=T) # OK # [1] "а молодость" gsub("\\bмолод", ">XX<", "а молодость", perl=F) # OK # [1] "а >XX<ость" # but why does "\\b" not work with perl=T? grep("\\bмолод", "а молодость", perl=T, value=T) # FAIL # character(0) gsub("\\bмолод", ">XX<", "а молодость", perl=T) # FAIL # [1] "а молодость" Any pointers would be much appreciated and acknowledged ... STG ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.