Thank you, Chris! I think it is exactly the problem you mentioned. I did consider 1000-point data is a large one at first.
I down-sampled the data from 1000 points to 100 points and ran KS test again. It worked as expected. Is there any typical method to compare two large samples? I also tried KL diverge, but it only gives me some number but does not tell me how large the distance is should be considered as significantly different. Regards, -Monnand On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chri...@med.umich.edu> wrote: > > The main issue is that the original distributions are the same, you shift the > two samples *by different amounts* (about 0.01 SD), and you have a large > (n=1000) sample size. Thus the new distributions are not the same. > > This is a problem with testing for equality of distributions. With large > samples, even a small deviation is significant. > > Chris > > -----Original Message----- > From: Monnand [mailto:monn...@gmail.com] > Sent: Sunday, January 11, 2015 10:13 PM > To: r-help@r-project.org > Subject: [R] two-sample KS test: data becomes significantly different after > normalization > > Hi all, > > This question is sort of related to R (I'm not sure if I used an R function > correctly), but also related to stats in general. I'm sorry if this is > considered as off-topic. > > I'm currently working on a data set with two sets of samples. The csv file > of the data could be found here: http://pastebin.com/200v10py > > I would like to use KS test to see if these two sets of samples are from > different distributions. > > I ran the following R script: > > # read data from the file >> data = read.csv('data.csv') >> ks.test(data[[1]], data[[2]]) > Two-sample Kolmogorov-Smirnov test > > data: data[[1]] and data[[2]] > D = 0.025, p-value = 0.9132 > alternative hypothesis: two-sided > The KS test shows that these two samples are very similar. (In fact, they > should come from same distribution.) > > However, due to some reasons, instead of the raw values, the actual data > that I will get will be normalized (zero mean, unit variance). So I tried > to normalize the raw data I have and run the KS test again: > >> ks.test(scale(data[[1]]), scale(data[[2]])) > Two-sample Kolmogorov-Smirnov test > > data: scale(data[[1]]) and scale(data[[2]]) > D = 0.3273, p-value < 2.2e-16 > alternative hypothesis: two-sided > The p-value becomes almost zero after normalization indicating these two > samples are significantly different (from different distributions). > > My question is: How the normalization could make two similar samples > becomes different from each other? I can see that if two samples are > different, then normalization could make them similar. However, if two sets > of data are similar, then intuitively, applying same operation onto them > should make them still similar, at least not different from each other too > much. > > I did some further analysis about the data. I also tried to normalize the > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but > same thing happened. At first, I thought it might be outliers caused this > problem (I can see that an outlier may cause this problem if I normalize > the data into [0,1] range.) I deleted all data whose abs value is larger > than 4 standard deviation. But it still didn't help. > > Plus, I even plotted the eCDFs, they *really* look the same to me even > after normalization. Anything wrong with my usage of the R function? > > Since the data contains ties, I also tried ks.boot ( > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same > result. > > Could anyone help me to explain why it happened? Also, any suggestion about > the hypothesis testing on normalized data? (The data I have right now is > simulated data. In real world, I cannot get raw data, but only normalized > one.) > > Regards, > -Monnand > > [[alternative HTML version deleted]] > > > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be > used for urgent or sensitive issues ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.