On Fri, Jun 10, 2011 at 10:15:36PM +0200, Kairavi Bhakta wrote: > Thanks for your answer. The reason I want the data to be uniform: It's the > first step in a machine learning project I am working on. If I know the data > isn't uniformly distributed, then this means there is probably something > wrong and the following steps will be biased by the non-uniform input data. > I'm not checking an assumption for another statistical test. > > Actually, the data has been normalized because it is supposed to represent a > probability distribution. That's why it sums to 1. My assumption is that, > for a vector of 5, the data at that point should look like 0.20 0.20 0.20 > 0.20 0.20, but of course there is variation, and I would like to test > whether the data comes close enough or not.
As others told you, this is not the right format for KS test. The words "testing uniformity" can mean different things and the meaning depends on which statistical model you assume. If we have a random variable with values in [0, 1], then testing uniformity means to test, to which extent its distribution is close to the uniform distribution on [0, 1]. The numbers, which concentrate around 0.2, will not satisfy this. If we have a discrete variable with k values, for which we have m independent observations, and the number of observations of value i is m_i, then it is possible to test, whether the variable has the uniform distribution on {1, ..., k} using Chi-squared test. Note that for this test, the original counts are needed, not their normalized values, which sum up to 1. For example, if we have 20 observations and the counts (m_1, ..., m_5) are (4, 3, 5, 2, 6), then this is quite consistent with the assumption of uniform distribution. On the other hand, if we have 200 observations and the counts are (40, 30, 50, 20, 60), then the null hypothesis of uniform distribution may be rejected (the uniform distribution is the default, see argument p in ?chisq.test) x <- c(40, 30, 50, 20, 60) chisq.test(x) Chi-squared test for given probabilities data: x X-squared = 25, df = 4, p-value = 5.031e-05 It is not clear, whether this is suitable for your application. If you generate the values in a different way, then another test may be needed. Can you specify more detail on how the numbers are generated? Petr Savicky. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.