# Today we studied comparing counts. We found there were  644 wepages that had cacophony mosquito *546 where * was any digit form 1-9 (Actually I forgot to write down the actual three numbers we used - sorry!).  Our counts for each digit were our observations: 

 Obs=c(229,113,73,56,41,37,37,30,28)
Tot=sum(Obs)
Tot

#which we could view as a table 

Cats=1:9
rbind(Cats,Obs)


# Before looking at the data we decided that it would be natural to test the hypothesis that the each digit would be equally likely.  H0: P(*=k)=1/9.  Under the hypothesis our expected counts would be:

Exp=(Tot/9)*rep(1,9)

# and we can now compute our Chi^2 value and its associated P-Vlaue. 

Chi2=sum((Obs-Exp)^2/Exp)
Chi2
df=length(Obs)-1
X=seq(0,3*df,by=0.01)
plot(X,dchisq(X,df),type="l",main=paste("The Chi-squared distribution, df=",df),xlab="Values",ylab="Probability Density")
P=1-pchisq(Chi2,df)
P

# How so our Null was REALLY wrong.  Here's the null the IRS would use H0: P(*=k)= (log(k+1)-log(k))/log(10), called the Benford distribution.  Let us test it:

pBen=(log((1:9)+1) -log(1:9))/log(10)
Exp=pBen*Tot
Chi2=sum((Obs-Exp)^2/Exp)
Chi2
df=length(Obs)-1
X=seq(0,max(3*df,Chi2+df),by=0.01)
plot(X,dchisq(X,df),type="l",main=paste("The Chi-squared distribution, df=",df),xlab="Values",ylab="Probability Density")
points(c(Chi2,Chi2),c(0,max(dchisq(X,df))),type="l",col="red")
P=1-pchisq(Chi2,df)
P

# This H0 is  completely consistent with our data, though we cannot prove it. 

#####

#Let us now verify that the Chi^2 is doing the right thing.  Let us assume that our first Null  H0: P(*=k)=1/9 is true fake a Chi^2 value: 

N=644
Data=round(runif(N,.5,9.5))
Obs=as.vector(table(Data))
Obs
Exp=(N/9)*rep(1,9)
Exp
Chi2=sum((Obs-Exp)^2/Exp)
Chi2

# Now let us do this M times and  compare a histogram of the results to the Chi^2 distribution:

H<-c()
M=1000
N=644

for (i in 1:M){
Data=round(runif(N,.5,9.5))
Obs=as.vector(table(Data))
Exp=(N/9)*rep(1,9)
Chi2=sum((Obs-Exp)^2/Exp)
H<-c(Chi2,H)
}
hist(H,freq=FALSE,main=paste("Simulating Chi^2 via Count Data, df=",df),xlab="Values",ylab="Probability Density",col="yellow")
X=seq(0,max(3*df,max(H)),by=0.01)
points(X,dchisq(X,df),type="l")

# Not bad, eh?