Thank you Kevin. I'm looking forward to trying your function when I get back to the office.
Jerry Floren Minnesota Department of Agriculture Kevin Wright-5 wrote: > > Here is a simple function I use. It uses Median +/- 5.2 * MAD. If I > recall, this flags about 1/2000 of values from a true Normal distribution. > > is.outlier = function (x) { > # See: Davies, P.L. and Gather, U. (1993). > # "The identification of multiple outliers" (with discussion) > # J. Amer. Statist. Assoc., 88, 782-801. > > x <- na.omit(x) > lims <- median(x) + c(-1, 1) * 5.2 * mad(x, constant = 1) > x < lims[1] | x > lims[2] > } > > Maybe the function should be called "is.patentable". I definitely agree > with Bert's comments. > > Kevin Wright > > > > On Wed, Dec 30, 2009 at 11:47 AM, Jerry Floren > <jerry.flo...@state.mn.us>wrote: > >> >> Greetings: >> >> I could also use guidance on this topic. I provide manure sample >> proficiency >> sets to agricultural labs in the United States and Canada. There are >> about >> 65 labs in the program. >> >> My data sets are much smaller and typically non-symmetrical with obvious >> outliers. Usually, there are 30 to 60 sets of data, each with triple >> replicates (90 to 180 observations). >> >> There are definitely outliers caused by the following: reporting in the >> wrong units, sending in the wrong spreadsheet, entering data in the wrong >> row, misplacing decimal points, calculation errors, etc. For each >> analysis, >> it is common that two to three labs make these types of errors. >> >> Since there are replicates, errors like misplaced decimal points are more >> obvious. However, most of the outlier errors are repeated for all three >> replicates. >> >> I use the median and Median Absolute Deviation (MAD, constant = 1) to >> flag >> labs for accuracy. Labs where the average of their three reps deviates >> more >> than 2.5 MAD values from the median are flagged for accuracy. With this >> method, it is not necessary to identify the outliers. >> >> A collegue suggested running the data twice. On the first run, outliers >> more >> than 4.0 MAD units from the median are removed. On the second run, values >> exceeding 2.9 times the MAD are flagged for accuracy. I tried this in R >> with >> a normally distributed data set of 100,000, and the 4.0 MAD values were >> nearly identical to the outliers identified with boxplot. >> >> With my data set, the flags do not change very much if the data is run >> one >> time with the flags set at 2.5 MAD units compared to running the data >> twice >> and removing the 4.0 MAD outliers and flagging the second set at 2.9 MAD >> units. Using either one of these methods might work for you, but I am not >> sure of the statistical value of these methods. >> >> Yours, >> >> Jerry Floren >> >> >> >> Brian G. Peterson wrote: >> > >> > John wrote: >> >> Hello, >> >> >> >> I've been searching for a method for identify outliers for quite some >> >> time now. The complication is that I cannot assume that my data is >> >> normally distributed nor symmetrical (i.e. some distributions might >> >> have one longer tail) so I have not been able to find any good tests. >> >> The Walsh's Test (http://www.statistics4u.info/ >> >> fundsta...liertest.html#), as I understand assumes that the data is >> >> symmetrical for example. >> >> >> >> Also, while I've found some interesting articles: >> >> http://tinyurl.com/yc7w4oq ("Missing Values, Outliers, Robust >> >> Statistics & Non-parametric Methods") >> >> I don't really know what to use. >> >> >> >> Any ideas? Any R packages available for this? Thanks! >> >> >> >> PS. My data has 1000's of observations.. >> > >> > Take a look at package 'robustbase', it provides most of the standard >> > robust >> > measures and calculations. >> > >> > While you didn't say what kind of data you're trying to identify >> outliers >> > in, >> > if it is time series data the function Return.clean in >> > PerformanceAnalytics may >> > be useful. >> > >> > Regards, >> > >> > - Brian >> > >> > >> > -- >> > Brian G. Peterson >> > http://braverock.com/brian/ >> > Ph: 773-459-4973 >> > IM: bgpbraverock >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> > >> >> -- >> View this message in context: >> http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p991062.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Kevin Wright > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p1009958.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.