[R] Strange space characters in character strings

Mark Breman Mon, 23 Aug 2010 08:39:23 -0700

Hello everyone,

I am reading a HTML table from a website with readHTMLTable() from the XML
package:


> library(XML)
> moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html";,
header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]]
> moose
            V1                                         V2          V3
1   07.02.2010  SWITCH to Long Bonds\n            (BTTRX)   $880,370
2   05.07.2010                       Switch to Gold (GLD)   $878,736
3   03.05.2010      Switch to US Small-cap Equities (IWM)   $895,676
4   01.22.2010                      Switch to Cash (3moT)   $895,572
..... truncated by me!

I am interested in the values in the third column:

> as.character(moose$V3)
 [1] "$880,370 "   "$878,736 "   "$895,676 "   "$895,572 "   "$932,139 "
"$932,131 "   "$1,013,505 " "$817,451 "   "$817,082 "   "$848,133 "
[11] "$904,527 "   " $903,981 "  "$902,582 "   "$896,170 "   "$809,853 "   "
$808,852 "  " $807,409 "  "$802,658 "   "$747,629 "   "$672,465 "
[21] " $671,826 "  "$645,352 "   "$615,174 "   "$609,415 "   " $590,664 "  "
$586,785 "  "$561,056 "   "$537,307 "   " $535,744 "  " $552,712 "
[31] "$551,615 "   " $508,790 "  "$501,161 "   "$499,023 "   " $446,568 "
 "$423,727 "   "$421,967 "   "$396,007 "   "$395,943 "   " $270,011 "
[41] "$264,386 "   "$278,513 "   "$251,855 "   "$251,685 "   " $129,198 "
 "$127,541 "   "$117,381 "   "$100,000 "   " "           " $275,417"
[51] "$266,459"    " $214,552"   "$207,312"    "$173,557"    "$167,647"
 "$150,516"    "$135,842"    "$126,667"    "$131,642"    "$113,804"
[61] "$107,364"    "$108,242"    " $102,881"   " $100,000"

Notice the spaces leading and lagging some of the values.

I want to get the values as numeric values, so I try to get rid of the
$-character and comma's with gsub() and a regular expression:

> gsub("[$,]", "", as.character(moose$V3))
 [1] "880370 "  "878736 "  "895676 "  "895572 "  "932139 "  "932131 "
 "1013505 " "817451 "  "817082 "  "848133 "  "904527 "  " 903981 " "902582
"
[14] "896170 "  "809853 "  " 808852 " " 807409 " "802658 "  "747629 "
 "672465 "  " 671826 " "645352 "  "615174 "  "609415 "  " 590664 " " 586785
"
[27] "561056 "  "537307 "  " 535744 " " 552712 " "551615 "  " 508790 "
"501161 "  "499023 "  " 446568 " "423727 "  "421967 "  "396007 "  "395943 "
[40] " 270011 " "264386 "  "278513 "  "251855 "  "251685 "  " 129198 "
"127541 "  "117381 "  "100000 "  " "        " 275417"  "266459"   " 214552"
[53] "207312"   "173557"   "167647"   "150516"   "135842"   "126667"
"131642"   "113804"   "107364"   "108242"   " 102881"  " 100000"

Looks fine to me. Now I can use as.numeric() to convert to numbers (leading
and lagging spaces should not be a problem):

> as.numeric(gsub("[$,]", "", as.character(moose$V3)))
 [1]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
  NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
[21]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
  NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
[41]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
266459     NA 207312 173557 167647 150516 135842 126667 131642 113804
[61] 107364 108242     NA     NA
Warning message:
NAs introduced by coercion

Something is wrong here! Let's have a look at one specific value:

> gsub("[$,]", "", as.character(moose$V3))[1]
[1] "880370 "
> as.numeric(gsub("[$,]", "", as.character(moose$V3))[1])
[1] NA
Warning message:
NAs introduced by coercion

If the last character in the string would be a regular space it would not be
a problem for as.numeric():

> as.numeric("880370 ")
[1] 880370

But it looks like it's not a regular space character:

> substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " "
[1] FALSE

It looks to me the spaces in some of the cells are not regular spaces. In
the original HTML table they are defined as "non breaking spaces" i.e.
&nbsp;

So my question is WHAT ARE THEY?
Is there a way to show the binary (hex) values of these characters?

Here is my environment:

> sessionInfo()
R version 2.11.1 (2010-05-31)
i486-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C              LC_TIME=en_US.utf8
       LC_COLLATE=en_US.utf8     LC_MONETARY=C
 [6] LC_MESSAGES=en_US.utf8    LC_PAPER=en_US.utf8       LC_NAME=C
      LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] XML_3.1-0

loaded via a namespace (and not attached):
[1] tools_2.11.1

Thanks,

-Mark-

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Strange space characters in character strings

Reply via email to