# Inferences from a regression analysis. Here we explore how to make inferences from a regression analysis. Namely we are exploring a situation were we believe that Y=beta1*X+beta0 + E for some unknown beta1 and beta0 and an normal error function whose value was constructed independently of X's value. Let us play the lord and choose for our simulations: beta1=2 beta0=-42 # Here is some data: N=100 X=rnorm(N,37.5,5) E=rnorm(N,0,5) Y=beta1*X+beta0+E plot(X,Y,main="The Scatterplot of our simulated data") #The name of the game is to guess beat1 from the data and then express our confidence in this guess. Recall the guess are called b1 and b0 from the line of best fit: L=lm(Y~X) b1=L[[1]][2] b0=L[[1]][1] # In practice before deciding that inferences are reasonable to make, we would get the residuals to explore the error's independence and normality: Yhat=(b1*X+b0) Res<-Y-Yhat plot(X,Res,main="Our X data verses the residuals",ylab="The Residuals",ylim=c(-max(abs(Res)),max(abs(Res)))) points(c(min(X),max(X)),c(0,0),col="red",type="l") hist(Res,main="Histogram of the residuals",freq=FALSE) se=sqrt(sum((Res)^2)/(N-2)) X=seq(-4*se,4*se,by=0.01) points(X,dnorm(X,0,se),col="blue",type="l") # If we believe that this justifies our faith the E is independent of X and normal then we construct a 100C % confidence interval for b1 as follows: C=.9 my=mean(Y) se=sqrt(sum((Res)^2)/(N-2)) sx=sd(X) SEb1=se/(sx*sqrt(N-1)) df=N-2 tstar=abs(qt((1-C)/2,df)) CI=c(b1-tstar*SEb1,b1,b1+tstar*SEb1) CI # We can verify that roughly 90% of the time the true value of beta is in our randomly produced confidence interval. Trials=100 Confs<-c() for (i in 1:Trials) { X=rnorm(N,37.5,5) E=rnorm(N,0,5) Y=beta1*X+beta0+E L=lm(Y~X) b1=L[[1]][2] b0=L[[1]][1] Yhat=(b1*X+b0) Res<-Y-Yhat my=mean(Y) se=sqrt(sum((Res)^2)/(N-2)) sx=sd(X) SEb1=se/(sqrt(N-1)*sx) df=N-2 tstar=abs(qt((1-C)/2,df)) CI=c(b1-tstar*SEb1,b1,b1+tstar*SEb1) Confs<-rbind(Confs,CI) } Count=0 left=min(Confs[,1]) right=max(Confs[,3]) Count=0 plot(c(left,left,right,right),c(Trials+1,0,0,Trials+1),type="l",main=paste(Trials,"Confidence",C), ylab="Trial",xlab="Slope") for (j in 1:Trials){ points(Confs[j,c(1,3)],c(j,j),type="l",col="red") points(Confs[j,1],c(j),pch=18) points(Confs[j,3],c(j),pch=18) #points(Confs[j,1],c(j),pch=18) {if (Confs[j,1]>beta1) Count=Count+1} {if (Confs[j,3]0. N=5 beta1=2 beta0=-42 X=rnorm(N,37.5,5) E=rnorm(N,0,5) Y=beta1*X+beta0+E L=lm(Y~X) b1=L[[1]][2] b0=L[[1]][1] Yhat=(b1*X+b0) Res<-Y-Yhat my=mean(Y) se=sqrt(sum((Res)^2)/(N-2)) sx=sd(X) SEb1=se/(sx*sqrt(N-1)) df=N-2 t=(b1-0)/(SEb1) P=1-pt(t,df) P # By running this a few times, we see that even for N=5 we often find our selves in a good position to reject the Null hypothesis (which we'd like to since beta1=2).