If those values represent response times in a system, then when I was responsible for characterizing what the system would do from the viewpoint of an SLA (service level agreement) with customers using the system, we usually specified that "90% of the transactions would have a response time of --- or less". This took care of most "long tails". So it depends on how you are planning to use this data. We usually monitored the 90th or 95th percentile to see how a system was operating day to day.
On Thu, Aug 18, 2011 at 8:52 AM, Petr PIKAL <petr.pi...@precheza.cz> wrote: > Hallo Jim > > Thank you and see within text. > > jim holtman <jholt...@gmail.com> napsal dne 18.08.2011 14:09:11: > >> I am not sure why you say that "lapply(ml, mean)" shows (incorrectly) >> that the second year has a larger average; it is correct for the data: >> >> > lapply(ml, my.func) >> $y1 >> Count Mean SD Min Median 90% 95% >> Max Sum >> 18.00000 16.83333 12.42980 4.00000 12.50000 37.20000 41.05000 >> 47.00000 303.00000 >> >> $y2 >> Count Mean SD Min Median 90% 95% >> Max Sum >> 15.00000 20.06667 25.27694 4.00000 11.00000 45.80000 70.40000 >> 97.00000 301.00000 >> >> >> You have a larger "outlier" in the second year that causes the mean to >> be higher. The median is lower, but I usually look at the 90th >> percentile if I am looking at response time from a system and again >> the second year has a higher value. >> >> So exactly why do you not "trust" your data? > > Well. I trust them, however mean is "correct" central value only when data > are normally distributed or at least symmetrical. As the values are > heavily distorted I feel that I shall not use mean for comparison of such > sets. Anyway t.test tells me that there is no difference between y2 and > y1. > >> t.test(ml[[1]], ml[[2]]) > > Welch Two Sample t-test > > data: ml[[1]] and ml[[2]] > t = -0.452, df = 19.557, p-value = 0.6563 > alternative hypothesis: true difference in means is not equal to 0 > 95 percent confidence interval: > -18.17781 11.71115 > sample estimates: > mean of x mean of y > 16.83333 20.06667 > > So based on this I probably will never get conclusive result as sd due to > "outliers" will be quite high. > > When I do > plot(ecdf(ml[[2]])) > plot(ecdf(ml[[1]]), add=T, col=2) > > it seems to me that both sets are almost the same and they differ > substantially only with those "outlier" values. > > If I decreased small values of y2 (e.g.) > > ml[[2]][ml[[2]]<20] <- ml[[2]][ml[[2]]<20]/2 > > I get same mean > > lapply(ml, mean) > $y1 > [1] 16.83333 > > $y2 > [1] 16.1 > > and t.test tells me that there is no difference between those two sets, > although I know that most events take half of the time and only few last > longer so for me such set is better (we improved performance for most of > the time however there are still scarce events which take a long time). > > plot(ecdf(ml[[2]])) > plot(ecdf(ml[[1]]), add=T, col=2) > > So still the question stays - what procedure to use for comparison of two > or more sets with such long tailed distribution? - Trimmed mean?, Median?, > ... > > Thanks. > > Regards > Petr > >> >> On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pi...@precheza.cz> > wrote: >> > Hallo all >> > >> > I try to find a way how to compare set of waiting times during > different >> > periods. I tried learn something from queueing theory and used also R >> > search. There is plenty of ways but I need to find the easiest and > quite >> > simple. >> > Here is a list with actual waiting times. >> > >> > ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7, 15, >> > 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5, >> > 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1", "y2")) >> > >> > par(mfrow=c(1,2)) >> > lapply(ml, hist) >> > >> > shows that in the first year is more longer waiting times >> > >> > lapply(ml, mean) >> > >> > shows (incorrectly) that in the second year there is longer average >> > waiting time. >> > >> > lapply(ml, mean) >> > >> > gives me completely reversed values. >> > >> > Can you please give me some hints what to use for "correct" and > "simple" >> > comparison of waiting times in two or more periods. >> > >> > Thank you >> > Petr >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> >> >> >> -- >> Jim Holtman >> Data Munger Guru >> >> What is the problem that you are trying to solve? > > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.