Re: [R] grubbs test to detect all outliers

Rui Barradas Sat, 29 Apr 2023 06:18:53 -0700

Às 14:01 de 29/04/2023, AbouEl-Makarim Aboueissa escreveu:

Hi Rui:



How about this dataset, please see below. I included a few outliers in each
column, as you can see in the printed dataset; please see below.


Once again, thank you very much, and sorry if I bothered you all.

abou

dput(datafortest)

structure(list(factor1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, NA, NA, NA, NA), levels = c("1", "2", "3"), class = "factor"),
     X = c(994455.077, 4348.031, 9999.789, 3813.139, 12.65, 5642.667,
     876684.386, 5165.731, NA, 3259.241, 8.383, 1997.878, 99990.608,
     2655.977, 9.49, 1826.851, 4386.002, 883295.091, 2120.902,
     NA, 2056.123, 5.088, NA, 92539.873, NA, NA, NA, NA), Y = c(76888L,
     333L, 618L, 10L, 344L, NA, 3L, 86999L, 265L, 557L, 77777L,
     383L, NA, NA, 87777L, 287L, 352L, 308L, 999526L, 489L, 2L,
     444L, 9L, 333L, NA, NA, NA, NA), factor2 = structure(c(1L,
     1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
     2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("1",
     "2", "3"), class = "factor"), Z = c(54999L, 475L, 15L, 603L,
     442L, 79486L, 927L, 971L, 388L, 888L, 514L, 409L, 546L, 523L,
     313L, 296L, 320L, 388L, 79999L, 677L, 555L, NA, 479L, 257L,
     313L, 21L, 320L, 4L), U = c(NA, NA, 1.5, 332, 216, 217, 1000,
     10, 9999, 444, NA, 5, 327, 58888, 456, 412, 251, 6, 398,
     438, 428, 15, NA, 406, 334, 465, 180, 88999), V = c(12, 240,
     9000, 265, NA, 99999, 1, 562, 13, 777, 322, NA, 99988, 653,
     450, 576, NA, 396.5, 91888, 5, 219, NA, 321, 417, 409, 999999,
     523, 10)), row.names = c(NA, -28L), class = "data.frame")

datafortest

    factor1          X      Y factor2     Z       U        V
1        1 994455.077  76888       1 54999      NA     12.0
2        1   4348.031    333       1   475      NA    240.0
3        1   9999.789    618       1    15     1.5   9000.0
4        1   3813.139     10       1   603   332.0    265.0
5        1     12.650    344       1   442   216.0       NA
6        1   5642.667     NA       1 79486   217.0  99999.0
7        1 876684.386      3       1   927  1000.0      1.0
8        2   5165.731  86999       1   971    10.0    562.0
9        2         NA    265       1   388  9999.0     13.0
10       2   3259.241    557       2   888   444.0    777.0
11       2      8.383  77777       2   514      NA    322.0
12       2   1997.878    383       2   409     5.0       NA
13       2  99990.608     NA       2   546   327.0  99988.0
14       2   2655.977     NA       2   523 58888.0    653.0
15       3      9.490  87777       2   313   456.0    450.0
16       3   1826.851    287       2   296   412.0    576.0
17       3   4386.002    352       2   320   251.0       NA
18       3 883295.091    308       2   388     6.0    396.5
19       3   2120.902 999526       3 79999   398.0  91888.0
20       3         NA    489       3   677   438.0      5.0
21       3   2056.123      2       3   555   428.0    219.0
22       3      5.088    444       3    NA    15.0       NA
23       3         NA      9       3   479      NA    321.0
24       3  92539.873    333       3   257   406.0    417.0
25    <NA>         NA     NA       3   313   334.0    409.0
26    <NA>         NA     NA       3    21   465.0 999999.0
27    <NA>         NA     NA       3   320   180.0    523.0
28    <NA>         NA     NA       3     4 88999.0     10.0




with many thanks
abou

______________________


*AbouEl-Makarim Aboueissa, PhD*

*Professor, Mathematics and Statistics*
*Graduate Coordinator*

*Department of Mathematics and Statistics*
*University of Southern Maine*



On Sat, Apr 29, 2023 at 8:05 AM Rui Barradas <ruipbarra...@sapo.pt> wrote:

Às 14:09 de 28/04/2023, AbouEl-Makarim Aboueissa escreveu:

*R: *Grubbs Test to detect all outliers Per group for all columns in a

data

frame



Dear All: good morning

I have a dataset (as an example) with two column factors (factor1 and
factor2) and 5 numerical columns (X,Y,Z,U,V). The X and Y columns have

same

length as factor1; and Z, U, and V have same length as factor2. Please

see

dataset is copied below. Please note that all dataset columns have NAs
values.

*Need help on this:*


Can we use the grubbs.test() function to detect all outliers and replace

it

by NA in X and Y datasets per group in factor1; and in Z, U, and V

datasets

per group in factor2. Columns in the dataframe have different lengths,

but

when I read the .csv file, R added NA values for the shorter columns.

If you need the .csv data file, please let me know.


Thank you very much for your help in advance.




install.packages("outliers")
library(outliers)

datafortest<-read.csv("G:/data_for_test.csv", header=TRUE)
datafortest

datafortest<-data.frame(datafortest)

datafortest$factor1<-as.factor(datafortest$factor1)
datafortest$factor2<-as.factor(datafortest$factor2)

str(datafortest)

##### tried to use grubbs.test() on a single column of the dataframe, but
still not working
tests.for.outliers.X<- grubbs.test(datafortest$X, na.rm = TRUE, type=11)


####################################

*grubbs.test() on a single dataset: but this can only detect if the min

and

the max are outliers.*


xx999<-c(0.088,1,2,3,4,5,6,7,8,9,88,98,99)
grubbs.test(xx999, type=11)




With many thanks

Abou



factor1      X            Y         factor2          Z           U
    V
1     4455.077 888 1 999           NA 999
1     4348.031 333 1 475            NA 240
1    9999.789 618 1 507 252 394
1    3813.139 417 1 603 332 265
1  7512.65 344 1 442 216           NA
1     5642.667            NA 1 486 217 275
1     6684.386 341 1 927 698 479
2     5165.731 999 1 971 311 562
2 NA 265 1 388 999 512
2     3259.241 557 2 888 444 777
2     3288.383 234 2 514            NA 322
2      1997.878 383 2 409 311           NA
2       99990.61           NA 2 546 327 728
2       2655.977          NA 2 523 228 653
3      3189.49 7777 2 313 456 450
3      1826.851 287 2 296 412 576
3      4386.002 352 2 320 251         NA
3      3295.091 308 2 388 888 396.5
3      2120.902 526 3 9999 398 888
3 NA 489 3 677 438 307
3      2056.123 291 3 555 428 219
3      1995.088 444 3              NA 319           NA
3 NA 349 3 479           NA 321
3      2539.873 333 3 257 406 417
        3 313 334 409
        3 296 465 546
        3 320 180 523
        3 388 999 313



______________________


*AbouEl-Makarim Aboueissa, PhD*

*Professor, Mathematics and Statistics*
*Graduate Coordinator*

*Department of Mathematics and Statistics*
*University of Southern Maine*

       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

Hello,

With the data file you have attached I cannot reproduce any errors, all
went well at the first try.


library(outliers)

fl <- "~/data_for_test.csv"
datafortest <- read.csv(fl)

# these are not needed to run the test
datafortest$factor1 <- as.factor(datafortest$factor1)
datafortest$factor2 <- as.factor(datafortest$factor2)
str(datafortest)
#> 'data.frame':    28 obs. of  7 variables:
#>  $ factor1: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 2 2 ...
#>  $ X      : num  4455 4348 10000 3813 7513 ...
#>  $ Y      : int  888 333 618 417 344 NA 341 999 265 557 ...
#>  $ factor2: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
#>  $ Z      : int  999 475 507 603 442 486 927 971 388 888 ...
#>  $ U      : int  NA NA 252 332 216 217 698 311 999 444 ...
#>  $ V      : num  999 240 394 265 NA 275 479 562 512 777 ...
head(datafortest)
#>   factor1        X   Y factor2   Z   U   V
#> 1       1 4455.077 888       1 999  NA 999
#> 2       1 4348.031 333       1 475  NA 240
#> 3       1 9999.789 618       1 507 252 394
#> 4       1 3813.139 417       1 603 332 265
#> 5       1 7512.650 344       1 442 216  NA
#> 6       1 5642.667  NA       1 486 217 275

##### tried to use grubbs.test() on a single column of the dataframe, but
##### still not working
grubbs.test(datafortest$X, type = 11)
#>
#>  Grubbs test for two opposite outliers
#>
#> data:  datafortest$X
#> G = 4.6640014, U = 0.0091756, p-value = 0.02867
#> alternative hypothesis: 1826.851 and 99990.608 are outliers



Hope this helps,

Rui Barradas

Hello,

With this data set the problem seems to be what you want to consider anoutlier. Types 10 and 11 give radically different results.

From the help page, section Details:

First test (10) is used to detect if the sample dataset contains oneoutlier, statistically different than the other values. Test is based bycalculating score of this outlier G (outlier minus mean and divided bysd) and comparing it to appropriate critical values. Alternative methodis calculating ratio of variances of two datasets - full dataset anddataset without outlier. The obtained value called U is bound with G bysimple formula.

Second test (11) is used to check if lowest and highest value are twooutliers on opposite tails of sample. It is based on calculation ofratio of range to standard deviation of the sample.

Third test (20) calculates ratio of variance of full sample and samplewithout two extreme observations. It is used to detect if datasetcontains two outliers on the same tail.

The results below seem to show that there are two outliers on the righttail. Do you have reasons to believe this is true? But that's astatistics question, the code runs fine.




library(outliers)

datafortest$factor1 <- as.factor(datafortest$factor1)
datafortest$factor2 <- as.factor(datafortest$factor2)

grubbs.test(datafortest$X, type = 10)
#>
#>  Grubbs test for one outlier
#>
#> data:  datafortest$X
#> G = 2.6106, U = 0.6422, p-value = 0.04389
#> alternative hypothesis: highest value 994455.077 is an outlier

grubbs.test(datafortest$X, type = 11)
#>
#>  Grubbs test for two opposite outliers
#>
#> data:  datafortest$X
#> G = 3.04754, U = 0.63726, p-value = 1
#> alternative hypothesis: 5.088 and 994455.077 are outliers

grubbs.test(datafortest$X, type = 20)
#>
#>  Grubbs test for two outliers
#>
#> data:  datafortest$X
#> U = 0.33892, p-value < 2.2e-16

#> alternative hypothesis: highest values 883295.091 , 994455.077 areoutliers



Hope this helps,

Rui Barradas

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] grubbs test to detect all outliers

Reply via email to