# Today we studied comparing counts. We found there were 644 wepages that had cacophony mosquito *546 where * was any digit form 1-9 (Actually I forgot to write down the actual three numbers we used - sorry!). Our counts for each digit were our observations: Obs=c(229,113,73,56,41,37,37,30,28) Tot=sum(Obs) Tot #which we could view as a table Cats=1:9 rbind(Cats,Obs) # Before looking at the data we decided that it would be natural to test the hypothesis that the each digit would be equally likely. H0: P(*=k)=1/9. Under the hypothesis our expected counts would be: Exp=(Tot/9)*rep(1,9) # and we can now compute our Chi^2 value and its associated P-Vlaue. Chi2=sum((Obs-Exp)^2/Exp) Chi2 df=length(Obs)-1 X=seq(0,3*df,by=0.01) plot(X,dchisq(X,df),type="l",main=paste("The Chi-squared distribution, df=",df),xlab="Values",ylab="Probability Density") P=1-pchisq(Chi2,df) P # How so our Null was REALLY wrong. Here's the null the IRS would use H0: P(*=k)= (log(k+1)-log(k))/log(10), called the Benford distribution. Let us test it: pBen=(log((1:9)+1) -log(1:9))/log(10) Exp=pBen*Tot Chi2=sum((Obs-Exp)^2/Exp) Chi2 df=length(Obs)-1 X=seq(0,max(3*df,Chi2+df),by=0.01) plot(X,dchisq(X,df),type="l",main=paste("The Chi-squared distribution, df=",df),xlab="Values",ylab="Probability Density") points(c(Chi2,Chi2),c(0,max(dchisq(X,df))),type="l",col="red") P=1-pchisq(Chi2,df) P # This H0 is completely consistent with our data, though we cannot prove it. ##### #Let us now verify that the Chi^2 is doing the right thing. Let us assume that our first Null H0: P(*=k)=1/9 is true fake a Chi^2 value: N=644 Data=round(runif(N,.5,9.5)) Obs=as.vector(table(Data)) Obs Exp=(N/9)*rep(1,9) Exp Chi2=sum((Obs-Exp)^2/Exp) Chi2 # Now let us do this M times and compare a histogram of the results to the Chi^2 distribution: H<-c() M=1000 N=644 for (i in 1:M){ Data=round(runif(N,.5,9.5)) Obs=as.vector(table(Data)) Exp=(N/9)*rep(1,9) Chi2=sum((Obs-Exp)^2/Exp) H<-c(Chi2,H) } hist(H,freq=FALSE,main=paste("Simulating Chi^2 via Count Data, df=",df),xlab="Values",ylab="Probability Density",col="yellow") X=seq(0,max(3*df,max(H)),by=0.01) points(X,dchisq(X,df),type="l") # Not bad, eh?