OK, that is not the correct format for the KS test (which is expecting data 
ranging from 0 to 1 with a fairly flat histogram).  You could possibly test 
this with a Chi-squared test.  Can you tell us more about how the numbers you 
are looking at are generated?  The Chi-squared test could be used on counts of 
1-5 and compared to the assumption that each is equally likely, but there still 
is the question of power and how close to uniform is uniform enough.  You would 
need huge samples to find a difference if the true distribution is only 
slightly non uniform.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111

From: kairavibha...@googlemail.com [mailto:kairavibha...@googlemail.com] On 
Behalf Of Kairavi Bhakta
Sent: Friday, June 10, 2011 2:16 PM
To: Greg Snow; r-help@r-project.org
Subject: RE: [R] Test if data uniformly distributed (newbie)

Thanks for your answer. The reason I want the data to be uniform: It's the 
first step in a machine learning project I am working on. If I know the data 
isn't uniformly distributed, then this means there is probably something wrong 
and the following steps will be biased by the non-uniform input data. I'm not 
checking an assumption for another statistical test.

Actually, the data has been normalized because it is supposed to represent a 
probability distribution. That's why it sums to 1. My assumption is that, for a 
vector of 5, the data at that point should look like 0.20 0.20 0.20 0.20 0.20, 
but of course there is variation, and I would like to test whether the data 
comes close enough or not.

At the moment I am only testing whether there are more a's than b's in the top 
and bottom portion of the each file (with a wilcoxon test, I have 8 reps of the 
model I am trying to build). But that sort of felt like a very adhoc solution 
and I figured maybe testing for uniformity would be better, or at least a 
important addition. I've also been looking into testing for the randomness of 
the sequence of a's and b's instead of the wilcoxon test, although that may or 
may not involve R.

Kairavi.


> Yes, punif is the function to use, however the KS test (and the others) are 
> based on an assumption of independence, and if you know that your data points 
> sum to 1, then they are not independent (and not uniform if there are more 
> than 2).  Also note that these tests only rule out distributions (with a 
> given type I error rate), but cannot confirm that the data comes from a given 
> distribution (just that either they do, or there is not enough power to 
> distinguish between the actual and the test distributions).

> What is your ultimate question/goal?  Why do you care if the data is uniform 
> or not?

> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.s...@imail.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599>
> 801.408.8111

[Hide Quoted Text]
-----Original Message-----
From: 
r-help-boun...@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599>
 
[mailto:r-help-bounces@r-<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599>
project.org<http://project.org>] On Behalf Of Kairavi Bhakta
Sent: Friday, June 10, 2011 11:24 AM
To: 
r-help@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599>
Subject: [R] Test if data uniformly distributed (newbie)

Hello,

I have a bunch of files containing 300 data points each with values from 0 to 1 
which also sum to 1 (I don't think  the last element is relevant though). In 
addition, each data point is annotated as an "a" or a "b".

I would like to know in which files (if any) the data is uniformly distributed.

I used Google and found out that a Kolmogorov-Smirnov or a Chi-square 
goodness-of-fit test could be used. Then I looked up ?kolmogorov and found 
"ks.test", but the example there is for the normal distribution and I am not 
sure how to adapt it for the uniform distribution. I did ?runif and read about 
the uniform distribution but it doesn't say what the "cumulative distribution" 
is. Is it "punif", like "pnorm"? I thought of that because I found a message on 
this list where someone was told to use "pnorm" instead of "dnorm". But the 
help page on the uniform distribution says punif is the "distribution 
function". Are the "cumulative distribution" and the "distribution function" 
the same thing? Having several names for the same thing has always confused me 
very much in statistics.

Also, I am not sure whether I need to specify any parameters for the 
distribution and which. I thought maybe I should specify "min=0" and "max=1" 
but those appear to be the defaults. Do I need to specify q, the vector
of quantiles?

So is
ks.test(x, punif)
correct or not for what I am attempting to do?
After this I will also need to find out whether the a's and b's are distributed 
randomly in each file. I would be greatful for any pointers although I have not 
researched this issue yet.

Kairavi.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599>
 mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting<http://www.r-project.org/posting>-
guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to