Hello everyone,

I am reading a HTML table from a website with readHTMLTable() from the XML
package:

> library(XML)
> moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html";,
header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]]
> moose
            V1                                         V2          V3
1   07.02.2010  SWITCH to Long Bonds\n            (BTTRX)   $880,370
2   05.07.2010                       Switch to Gold (GLD)   $878,736
3   03.05.2010      Switch to US Small-cap Equities (IWM)   $895,676
4   01.22.2010                      Switch to Cash (3moT)   $895,572
..... truncated by me!

I am interested in the values in the third column:

> as.character(moose$V3)
 [1] "$880,370 "   "$878,736 "   "$895,676 "   "$895,572 "   "$932,139 "
"$932,131 "   "$1,013,505 " "$817,451 "   "$817,082 "   "$848,133 "
[11] "$904,527 "   " $903,981 "  "$902,582 "   "$896,170 "   "$809,853 "   "
$808,852 "  " $807,409 "  "$802,658 "   "$747,629 "   "$672,465 "
[21] " $671,826 "  "$645,352 "   "$615,174 "   "$609,415 "   " $590,664 "  "
$586,785 "  "$561,056 "   "$537,307 "   " $535,744 "  " $552,712 "
[31] "$551,615 "   " $508,790 "  "$501,161 "   "$499,023 "   " $446,568 "
 "$423,727 "   "$421,967 "   "$396,007 "   "$395,943 "   " $270,011 "
[41] "$264,386 "   "$278,513 "   "$251,855 "   "$251,685 "   " $129,198 "
 "$127,541 "   "$117,381 "   "$100,000 "   " "           " $275,417"
[51] "$266,459"    " $214,552"   "$207,312"    "$173,557"    "$167,647"
 "$150,516"    "$135,842"    "$126,667"    "$131,642"    "$113,804"
[61] "$107,364"    "$108,242"    " $102,881"   " $100,000"

Notice the spaces leading and lagging some of the values.

I want to get the values as numeric values, so I try to get rid of the
$-character and comma's with gsub() and a regular expression:

> gsub("[$,]", "", as.character(moose$V3))
 [1] "880370 "  "878736 "  "895676 "  "895572 "  "932139 "  "932131 "
 "1013505 " "817451 "  "817082 "  "848133 "  "904527 "  " 903981 " "902582
"
[14] "896170 "  "809853 "  " 808852 " " 807409 " "802658 "  "747629 "
 "672465 "  " 671826 " "645352 "  "615174 "  "609415 "  " 590664 " " 586785
"
[27] "561056 "  "537307 "  " 535744 " " 552712 " "551615 "  " 508790 "
"501161 "  "499023 "  " 446568 " "423727 "  "421967 "  "396007 "  "395943 "
[40] " 270011 " "264386 "  "278513 "  "251855 "  "251685 "  " 129198 "
"127541 "  "117381 "  "100000 "  " "        " 275417"  "266459"   " 214552"
[53] "207312"   "173557"   "167647"   "150516"   "135842"   "126667"
"131642"   "113804"   "107364"   "108242"   " 102881"  " 100000"

Looks fine to me. Now I can use as.numeric() to convert to numbers (leading
and lagging spaces should not be a problem):

> as.numeric(gsub("[$,]", "", as.character(moose$V3)))
 [1]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
  NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
[21]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
  NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
[41]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
266459     NA 207312 173557 167647 150516 135842 126667 131642 113804
[61] 107364 108242     NA     NA
Warning message:
NAs introduced by coercion

Something is wrong here! Let's have a look at one specific value:

> gsub("[$,]", "", as.character(moose$V3))[1]
[1] "880370 "
> as.numeric(gsub("[$,]", "", as.character(moose$V3))[1])
[1] NA
Warning message:
NAs introduced by coercion

If the last character in the string would be a regular space it would not be
a problem for as.numeric():

> as.numeric("880370 ")
[1] 880370

But it looks like it's not a regular space character:

> substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " "
[1] FALSE

It looks to me the spaces in some of the cells are not regular spaces. In
the original HTML table they are defined as "non breaking spaces" i.e.
 

So my question is WHAT ARE THEY?
Is there a way to show the binary (hex) values of these characters?

Here is my environment:

> sessionInfo()
R version 2.11.1 (2010-05-31)
i486-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C              LC_TIME=en_US.utf8
       LC_COLLATE=en_US.utf8     LC_MONETARY=C
 [6] LC_MESSAGES=en_US.utf8    LC_PAPER=en_US.utf8       LC_NAME=C
      LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] XML_3.1-0

loaded via a namespace (and not attached):
[1] tools_2.11.1

Thanks,

-Mark-

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to