On Aug 9, 2010, at 6:27 PM, Alexander Eggel wrote:
Hello everybody,
I need to know which samples (S1-S6) contain a value that is bigger
than the
median + five standard deviations of the column he is in. This is
just an
example. Command should be applied to a data frame wich is a lot
bigger
(over 100 columns). Any solutions? Thank you very much for your
help!!!
s
Samples A B C E
1 S1 1 2 3 7
2 S2 4 NA 6 6
3 S3 7 8 9 NA
4 S4 4 5 NA 6
5 S5 2 5 6 7
6 S6 2 3 4 5
This loop works fine for a column without NA values. However it
doesn't work
for the other columns. I should have a loop that I could apply to all
columns ideally in "one command".
o <- data.frame();
for (i in 1:nrow(s))
{
dd <- s[i,];
if (dd$A >= median(s$A, na.rm=TRUE) + 5 * sd(s$A, na.rm=TRUE))
o <-
rbind(o,dd)
}
Let's look at the more general problem of how to do column-wise
calculations (since I suspect there is not much support in this
neighborhood for the notion that you have a proper definition of
"outlier" and furthermore you have not provided an example where any
such outliers exist). Let's just calculate a set of logical vectors
that signal whether a value is greater than one sd above the median:
apply(s[-1], 2, function(x) {x > median(x, na.rm=TRUE) + sd(x,
na.rm=TRUE)})
A B C E
1 FALSE FALSE FALSE TRUE
2 FALSE NA FALSE FALSE
3 TRUE TRUE TRUE NA
4 FALSE FALSE NA FALSE
5 FALSE FALSE FALSE TRUE
6 FALSE FALSE FALSE FALSE
Each column is passed in turn to the function (as a vector) and the
function then calcuates the median() and sd() with that vector as the
first argument. The ">" operator has a vector on the lhs and a scalar
on the rhs but that is perfectly fine and we get the expected results
in a logical matrix.
--
David.
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.