Dear R-help,
Â
Chi Square Test for Goodness of Fit
Â
Â
Problem Faced :
Â
I have got a discrete data
as given below (R script)
Â
No_of_Frauds <-c
1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)
Â
I am trying to fit Poisson
distribution to this data using R.
Â
When I run this script using
R â console,
Â
I am getting value of Chi â Square Statistics as
high as â6.95753e+37â
Â
When I did the same calculations in Excel, I got
the Chi Square Statistics value = 138.34.
Â
Although it is clear that the sample data doesnât
follow Poisson distribution, and I will have to look for other discrete
distribution, my problem is the HIGH Value of Chi Square test statistics. When
I analyzed further, I understood the problem.
Â
(A) By convention, if your Expected
frequency is less than 5, then by we put together such classes and form a new
class such that Expected frequency is greater than 5 and also accordingly
adjust the observed frequencies.
Â
X
Oi
Ei
((Oi - Ei)^2)/Ei
0
0
10
9.96
1
72
23
103.79
2
17
27
3.54
3
5
21
11.85
4
3
12
6.71
5
4
9
2.51
Total
101
101
138.34
Â
Â
When I apply this logic in Excel, I am getting the
reasonable result (i.e. 138.34), however in Excel also, if I donât apply this
logic, my Chi square test statistic value is as high as 4.70043E+37.
Â
My
question is how do I modify my R â script, so that the logic mentioned in (A)
i.e. adjusting the Expected frequencies (and accordingly Observed frequencies)
is
applied so that the expected frequency becomes greater than 5 for a given
class, thereby resulting in reasonable value of Chi Square test Statistics.
Â
My R â script is given below -
________________________________________________________
Â
# R SCRIPT for Fitting
Poisson Distribution
Â
No_of_Frauds <-c
1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)
Â
Â
N Â Â Â Â Â Â Â Â Â Â Â Â <-Â Â Â Â Â Â Â Â Â Â Â Â length(No_of_Frauds)
Â
Average    <-            mean(No_of_Frauds)
Â
Lambda    <-            Average
Â
i              <-            c(0:(N-1))
Â
pmf         <-            dpois(i, Lambda, log = FALSE)
Â
Â
#
----------------------------------------------------------------------------
Â
Â
# Ho: The data follow Poisson
Distribution Vs H1: Not Ho
Â
Â
# observed frequencies (Oi)
Â
variable.cnts
     <-    table(No_of_Frauds)
variable.cnts.prs
<-Â Â Â Â dpois(as.numeric(names(variable.cnts)),
lambda)
variable.cnts
     <-    c(variable.cnts, 0)
Â
variable.cnts.prs <-Â Â Â Â c(variable.cnts.prs,
1-sum(variable.cnts.prs))
tst
                  <-    chisq.test(variable.cnts,
p=variable.cnts.prs)
Â
chi_squared
      <-    as.numeric(unclass(tst)$statistic)
p_value           <-    as.numeric(unclass(tst)$p.value)
df
                   <-    tst[2]$parameter
Â
Â
cv1Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â <-Â Â Â Â qchisq(p=.01,
df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)
Â
cv2Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â <-Â Â Â Â qchisq(p=.05,
df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)
Â
cv3Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â <-Â Â Â Â qchisq(p=.1,
df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)
#-----------------------------------------------------------------------------
Â
# Expected value
Â
# variable.cnts.prs *
sum(variable.cnts)
Â
Â
#
if tst > cv reject Ho at alpha confidence level
Â
#-----------------------------------------------------------------------------
Â
if(chi_squared > cv1)
Â
Conclusion1 <- 'Sample
does not come from the postulated probability distribution at 1% los' else
Conclusion1 <- 'Sample
comes from postulated prob. distribution at 1% los'
Â
Â
if(chi_squared > cv2)
Â
Conclusion2 <- 'Sample
does not come from the postulated probability distribution at 5% los' else
Conclusion2 <- 'Sample
comes from postulated prob. distribution at 1% los'
Â
if(chi_squared > cv3)
Conclusion3 <- 'Sample
does not come from the postulated probability distribution at 10% los' else
Conclusion3 <- 'Sample
come from postulated prob distribution at 1% los'
Â
#-----------------------------------------------------------------------------
Â
# Printing RESULTS
Â
print(chi_squared)
Â
print(p_value)
Â
print(df)
Â
print(cv1)
Â
print(cv2)
Â
print(cv3)
Â
print(Conclusion1)
Â
print(Conclusion2)
Â
print(Conclusion3)
Â
Â
##### End of R Script
########
Â
________________________________________________________
Â
I sincerely apologize for taking liberty of writing
such a long mail and since I am very new to this âR languageâ can someone
help
me out.
Â
Thanking in advance for your kind co-operation.
Â
Ashok (Mumbai,
India)
Â
Â
Â
Â
Â
Â
o.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.