Hello everyone, I am reading a HTML table from a website with readHTMLTable() from the XML package:
> library(XML) > moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html", header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]] > moose V1 V2 V3 1 07.02.2010 SWITCH to Long Bonds\n (BTTRX) $880,370 2 05.07.2010 Switch to Gold (GLD) $878,736 3 03.05.2010 Switch to US Small-cap Equities (IWM) $895,676 4 01.22.2010 Switch to Cash (3moT) $895,572 ..... truncated by me! I am interested in the values in the third column: > as.character(moose$V3) [1] "$880,370 " "$878,736 " "$895,676 " "$895,572 " "$932,139 " "$932,131 " "$1,013,505 " "$817,451 " "$817,082 " "$848,133 " [11] "$904,527 " " $903,981 " "$902,582 " "$896,170 " "$809,853 " " $808,852 " " $807,409 " "$802,658 " "$747,629 " "$672,465 " [21] " $671,826 " "$645,352 " "$615,174 " "$609,415 " " $590,664 " " $586,785 " "$561,056 " "$537,307 " " $535,744 " " $552,712 " [31] "$551,615 " " $508,790 " "$501,161 " "$499,023 " " $446,568 " "$423,727 " "$421,967 " "$396,007 " "$395,943 " " $270,011 " [41] "$264,386 " "$278,513 " "$251,855 " "$251,685 " " $129,198 " "$127,541 " "$117,381 " "$100,000 " " " " $275,417" [51] "$266,459" " $214,552" "$207,312" "$173,557" "$167,647" "$150,516" "$135,842" "$126,667" "$131,642" "$113,804" [61] "$107,364" "$108,242" " $102,881" " $100,000" Notice the spaces leading and lagging some of the values. I want to get the values as numeric values, so I try to get rid of the $-character and comma's with gsub() and a regular expression: > gsub("[$,]", "", as.character(moose$V3)) [1] "880370 " "878736 " "895676 " "895572 " "932139 " "932131 " "1013505 " "817451 " "817082 " "848133 " "904527 " " 903981 " "902582 " [14] "896170 " "809853 " " 808852 " " 807409 " "802658 " "747629 " "672465 " " 671826 " "645352 " "615174 " "609415 " " 590664 " " 586785 " [27] "561056 " "537307 " " 535744 " " 552712 " "551615 " " 508790 " "501161 " "499023 " " 446568 " "423727 " "421967 " "396007 " "395943 " [40] " 270011 " "264386 " "278513 " "251855 " "251685 " " 129198 " "127541 " "117381 " "100000 " " " " 275417" "266459" " 214552" [53] "207312" "173557" "167647" "150516" "135842" "126667" "131642" "113804" "107364" "108242" " 102881" " 100000" Looks fine to me. Now I can use as.numeric() to convert to numbers (leading and lagging spaces should not be a problem): > as.numeric(gsub("[$,]", "", as.character(moose$V3))) [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [21] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [41] NA NA NA NA NA NA NA NA NA NA 266459 NA 207312 173557 167647 150516 135842 126667 131642 113804 [61] 107364 108242 NA NA Warning message: NAs introduced by coercion Something is wrong here! Let's have a look at one specific value: > gsub("[$,]", "", as.character(moose$V3))[1] [1] "880370 " > as.numeric(gsub("[$,]", "", as.character(moose$V3))[1]) [1] NA Warning message: NAs introduced by coercion If the last character in the string would be a regular space it would not be a problem for as.numeric(): > as.numeric("880370 ") [1] 880370 But it looks like it's not a regular space character: > substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " " [1] FALSE It looks to me the spaces in some of the cells are not regular spaces. In the original HTML table they are defined as "non breaking spaces" i.e. So my question is WHAT ARE THEY? Is there a way to show the binary (hex) values of these characters? Here is my environment: > sessionInfo() R version 2.11.1 (2010-05-31) i486-pc-linux-gnu locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 LC_MONETARY=C [6] LC_MESSAGES=en_US.utf8 LC_PAPER=en_US.utf8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_3.1-0 loaded via a namespace (and not attached): [1] tools_2.11.1 Thanks, -Mark- [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.