On Aug 2, 2009, at 7:02 PM, Noah Silverman wrote:
Hi,
It seems as if the problem was caused by an odd quirk of the "scale"
function.
Some of my data have NA entries.
So, I substitute 0 for any NA with:
rawdata[is.na(rawdata)] <- 0
Perhaps this would have done what you intended:
rawdata[is.na(rawdata), ] <- 0
# But this is added _only_ as a matter of coding behavior. See below.
I then scale the data.
For some reason that I don't understand, I find some NA back in the
data
after the scale command.
But, issuing the same 0 substitution AFTER the scale command makes
everything work again.
rawdata[is.na(rawdata)] <- 0
It "works" because rawdata has been converted by scale() to a matrix
which can be accessed as a vector.
The notion of adding zeroes for NA seems "so wrong". And the idea that
you might get the same results of doing so before scale() as after
scale() seems additionally bizarre.
VERY strange behavior.
Your behavior might be seen as VERY strange by some.
--
D
-N
On 8/2/09 3:57 PM, J Dougherty wrote:
On Sunday 02 August 2009 02:34:43 pm Noah Silverman wrote:
The column names have to obfuscated, but here are 10 rows of the
data.
label c0 c1 c2 c3 c4 c5 c6 c7 c8
c9 c10 c11 c12 c13
c14 c15 c16 c17 c18 c19 c20 c21 c22 c23
c24 c25 c26 c27
c28 c29 c30 c31 c32 c33 c34 c35 c36 c37
c38 c39 c40 c41
c42 c43 c44 c45 c46 c47 c48 c49 c50 c51
c52 c53 c54 c55
c56 c57 c58 c59 c60 c61 c62 c63 c64 c65
c66
sick 2008-12-28_1 95.609 5 3.3 1.35 0 1 35
9.6666 0 0
0.0833 1 0.0833 1 0.1428 7 3 2.035714286 6.5
94.8481
53.846 12 -4.69 1.25 0.5062 0.0522 0.1808 3 0.5126 0.0694
0.2061 94.9288 8.3125 0.0247 7.5833 9.3 35 9.6666 0 0
0.0833 1 0.0833 1 0.1428 7 3 2.035714286 6.5
94.8481
53.846 12 -4.69 1.25 0.5062 0.0522 0.1808 3 0.5126 0.0694
0.2061 94.9288 8.3125 0.0247 7.5833 9.3
well 2008-12-28_1 95.338 1 11 3.2 3 2 11
7.0277 0.0555 2
0.1666 6 0.1666 5 0.238 18 11 2.541666667 2.022727273
94.7733
38.461 36 6.07 7.5555 0.5928 0.0955 0.2871 0 0.5434 0.0679
0.2283 95.9003 5.1736 0.0847 7.3333 28 11 7.0277 0.0555
2
0.1666 6 0.1666 5 0.238 18 11 2.541666667 2.022727273
94.7733
38.461 36 6.07 7.5555 0.5928 0.0955 0.2871 0 0.5434 0.0679
0.2283 95.9003 5.1736 0.0847 7.3333 28
well 2008-12-28_1 95.204 2 7.4 2.75 4 1 22
8.4545 0 0
0 0 0 0 0 6 4 2.791666667 2.5625
94.8444 61.538 11 2.84
3.0909 0.5693 0.0641 0.2738 0 0.5874 0.1011 0.2803 94.9769
8.1363 0.0467 5.4545 10 22 8.4545 0 0 0 0
0 0 0 6 4
2.791666667 2.5625 94.8444 61.538 11 2.84 3.0909 0.5693
0.0641
0.2738 0 0.5874 0.1011 0.2803 94.9769 8.1363 0.0467 5.4545
10
sick 2008-12-28_1 95.204 14 48
0 3 25 8.7045 0.0909 4 0.2045 9 0.2045
4 0.2666 11 8
4.409090909 0 95.0006 15.384 44 1.76 7.409 0.4475
0.0285
0.1206 0 0.5094 0.058 0.1931 92.9455 7.2613 0.0532 4.5227
82 25 8.7045 0.0909 4 0.2045 9 0.2045 4 0.2666
11 8
4.409090909 0 95.0006 15.384 44 1.76 7.409 0.4475
0.0285
0.1206 0 0.5094 0.058 0.1931 92.9455 7.2613 0.0532 4.5227
82
well 2008-12-28_1 95.07 13 26
1 1 11 8.1 0.0666 2 0.1666 5 0.1666
0 0 21 16
2.571428571 1.984375 94.825 30.769 30 -4.69 -0.7999
0.5166
0.0624 0.2078 0 0.5306 0.0792 0.2398 95.2282 7.575 0.0715
3.4333 44 11 8.1 0.0666 2 0.1666 5 0.1666 0
0 21 16
2.571428571 1.984375 94.825 30.769 30 -4.69 -0.7999
0.5166
0.0624 0.2078 0 0.5306 0.0792 0.2398 95.2282 7.575 0.0715
3.4333 44
well 2008-12-28_1 95.07 9 16
0 4 39 9.4117 0 0 0.0588 1 0.0588
0 0 3 25 3.916666667
2.96 94.8177 30.769 17 -20.84 -15.8234 0.8205 0.3333
0.6666 0
0.6054 0.1287 0.3292 95.3232 6.9117 0.076 2.647 16 39
9.4117 0 0 0.0588 1 0.0588 0 0 3 25
3.916666667 2.96
94.8177 30.769 17 -20.84 -15.8234 0.8205 0.3333 0.6666 0
0.6054 0.1287 0.3292 95.3232 6.9117 0.076 2.647 16
sick 2008-12-28_1 94.936 6 11
4 1 28 7.725 0.075 3 0.125 5 0.125
0 0 6 2 4 1.75
94.7815 46.153 40 6.07 12.5 0.5014 0.0621 0.1972 6
0.523
0.0742 0.2035 95.794 6.0625 0.046 7.25 12 28 7.725 0.075
3
0.125 5 0.125 0 0 6 2 4 1.75 94.7815 46.153 40 6.07
12.5
0.5014 0.0621 0.1972 6 0.523 0.0742 0.2035 95.794 6.0625
0.046 7.25 12
well 2008-12-28_1 94.803 11 13
0 5 35 7.125 0.0937 3 0.1562 5 0.1562
5 0.2 18 17
1.555555556 2.794117647 95.0398 38.461 32 10.38 8.4063
0.5804
0.0871 0.2627 1 0.558 0.0738 0.2324 92.4367 5.289 0.0722
9.125 16 35 7.125 0.0937 3 0.1562 5 0.1562 5
0.2 18 17
1.555555556 2.794117647 95.0398 38.461 32 10.38 8.4063
0.5804
0.0871 0.2627 1 0.558 0.0738 0.2324 92.4367 5.289 0.0722
9.125 16
well 2008-12-28_1 94.67 4 38
5 1 11 8.9642 0.0357 1 0.1428 4 0.1428
4 0.2105 11 13
3.772727273 4.307692308 94.8451 23.076 28 -5.76 -4
0.3269 0
0.0833 0 0.5222 0.0616 0.2079 94.9668 8.6696 0.0663 4.6428
14 11 8.9642 0.0357 1 0.1428 4 0.1428 4 0.2105
11 13
3.772727273 4.307692308 94.8451 23.076 28 -5.76 -4
0.3269 0
0.0833 0 0.5222 0.0616 0.2079 94.9668 8.6696 0.0663 4.6428
14
well 2008-12-28_1 94.537 12 39
0 1 35 9.4444 0 0 0 0 0 0 0 2 7 2.5 2.892857143
94.878
23.076 9 -12.23 -9.6666 0.4428 0 0.0857 0 0.5411 0.0849
0.25
94.54 8.9166 0.0296 6.1111 67 35 9.4444 0 0 0
0 0 0 0
2 7 2.5 2.892857143 94.878 23.076 9 -12.23 -9.6666 0.4428
0
0.0857 0 0.5411 0.0849 0.25 94.54 8.9166 0.0296 6.1111 67
Your initial post mentions 70 columns in your data table, yet the
example
shows 67 counting the initial "labels" term in the header. I would
suggest
adding "row.names = NULL" to force row numbers and see how that
behaves, e.g.
rawdata<- read.table("r_work/train_data.csv", header=T, sep=",",
na.strings=0, row.names = NULL)
Otherwise, you might want to consult the R Manual where it states:
header a logical value indicating whether the file contains the
names of the
variables as its first line. If missing, the value is determined
from the
file format: header is set to TRUE if and only if the first row
contains one
fewer field than the number of columns.
So, you might also want to count up your column names in the header
line.
JWDougherty
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.