Hello everyone,
I am reading a HTML table from a website with readHTMLTable() from the XML
package:
library(XML)
moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html",
header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]]
moose
V1 V2 V3
1 07.02.2010 SWITCH to Long Bonds\n (BTTRX) $880,370
2 05.07.2010 Switch to Gold (GLD) $878,736
3 03.05.2010 Switch to US Small-cap Equities (IWM) $895,676
4 01.22.2010 Switch to Cash (3moT) $895,572
..... truncated by me!
I am interested in the values in the third column:
as.character(moose$V3)
[1] "$880,370 " "$878,736 " "$895,676 " "$895,572 " "$932,139 "
"$932,131 " "$1,013,505 " "$817,451 " "$817,082 " "$848,133"
[11] "$904,527 " " $903,981 " "$902,582 " "$896,170 " "$809,853 " "
$808,852 " " $807,409 " "$802,658 " "$747,629 " "$672,465 "
[21] " $671,826 " "$645,352 " "$615,174 " "$609,415 " " $590,664 " "
$586,785 " "$561,056 " "$537,307 " " $535,744 " " $552,712 "
[31] "$551,615 " " $508,790 " "$501,161 " "$499,023 " " $446,568 "
"$423,727 " "$421,967 " "$396,007 " "$395,943 " " $270,011 "
[41] "$264,386 " "$278,513 " "$251,855 " "$251,685 " " $129,198 "
"$127,541 " "$117,381 " "$100,000 " " " " $275,417"
[51] "$266,459" " $214,552" "$207,312" "$173,557" "$167,647"
"$150,516" "$135,842" "$126,667" "$131,642" "$113,804"
[61] "$107,364" "$108,242" " $102,881" " $100,000"
Notice the spaces leading and lagging some of the values.
I want to get the values as numeric values, so I try to get rid of the
$-character and comma's with gsub() and a regular expression:
gsub("[$,]", "", as.character(moose$V3))
[1] "880370 " "878736 " "895676 " "895572 " "932139 " "932131 "
"1013505 " "817451 " "817082 " "848133 " "904527 " " 903981 " "902582
"
[14] "896170 " "809853 " " 808852 " " 807409 " "802658 " "747629 "
"672465 " " 671826 " "645352 " "615174 " "609415 " " 590664 " " 586785
"
[27] "561056 " "537307 " " 535744 " " 552712 " "551615 " " 508790 "
"501161 " "499023 " " 446568 " "423727 " "421967 " "396007 " "395943"
[40] " 270011 " "264386 " "278513 " "251855 " "251685 " " 129198 "
"127541 " "117381 " "100000 " " " " 275417" "266459" " 214552"
[53] "207312" "173557" "167647" "150516" "135842" "126667"
"131642" "113804" "107364" "108242" " 102881" " 100000"
Looks fine to me. Now I can use as.numeric() to convert to numbers (leading
and lagging spaces should not be a problem):
as.numeric(gsub("[$,]", "", as.character(moose$V3)))
[1] NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA
[21] NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA
[41] NA NA NA NA NA NA NA NA NA NA
266459 NA 207312 173557 167647 150516 135842 126667 131642 113804
[61] 107364 108242 NA NA
Warning message:
NAs introduced by coercion
Something is wrong here! Let's have a look at one specific value:
gsub("[$,]", "", as.character(moose$V3))[1]
[1] "880370 "
as.numeric(gsub("[$,]", "", as.character(moose$V3))[1])
[1] NA
Warning message:
NAs introduced by coercion
If the last character in the string would be a regular space it would not be
a problem for as.numeric():
as.numeric("880370 ")
[1] 880370
But it looks like it's not a regular space character:
substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " "
[1] FALSE
It looks to me the spaces in some of the cells are not regular spaces. In
the original HTML table they are defined as "non breaking spaces" i.e.
So my question is WHAT ARE THEY?
Is there a way to show the binary (hex) values of these characters?