# Inferences from a regression analysis.  Here we explore how to make inferences from a regression analysis.  Namely we are exploring a situation were we believe that Y=beta1*X+beta0 + E for some unknown beta1 and beta0 and an normal error function whose value was constructed independently of X's value.  Let us play the lord and choose for our simulations:

beta1=2
beta0=-42

# Here is some data:

N=100
X=rnorm(N,37.5,5)
E=rnorm(N,0,5)
Y=beta1*X+beta0+E
plot(X,Y,main="The Scatterplot of our simulated data")

#The name of the game is to guess beat1  from the data and then express our  confidence in this guess.  Recall the guess are called b1 and b0 from the line of best fit:

L=lm(Y~X)
b1=L[[1]][2]
b0=L[[1]][1]

# In practice before deciding that inferences are reasonable to make, we would get the residuals to explore the error's independence and normality:

Yhat=(b1*X+b0)
Res<-Y-Yhat
plot(X,Res,main="Our X data verses the residuals",ylab="The Residuals",ylim=c(-max(abs(Res)),max(abs(Res))))
points(c(min(X),max(X)),c(0,0),col="red",type="l")
hist(Res,main="Histogram of the residuals",freq=FALSE)
se=sqrt(sum((Res)^2)/(N-2))
X=seq(-4*se,4*se,by=0.01)
points(X,dnorm(X,0,se),col="blue",type="l")

#  If we believe that this justifies our faith the E is independent of X and normal then we construct a 100C % confidence interval for b1 as follows: 

C=.9
my=mean(Y)
se=sqrt(sum((Res)^2)/(N-2))
sx=sd(X)
SEb1=se/(sx*sqrt(N-1))
df=N-2
tstar=abs(qt((1-C)/2,df))
CI=c(b1-tstar*SEb1,b1,b1+tstar*SEb1)
CI

# We can verify that roughly 90% of the time the true value of beta is in our randomly produced confidence interval. 

Trials=100

Confs<-c()
for (i in 1:Trials)
{
X=rnorm(N,37.5,5)
E=rnorm(N,0,5)
Y=beta1*X+beta0+E
L=lm(Y~X)
b1=L[[1]][2]
b0=L[[1]][1]
Yhat=(b1*X+b0)
Res<-Y-Yhat
my=mean(Y)
se=sqrt(sum((Res)^2)/(N-2))
sx=sd(X)
SEb1=se/(sqrt(N-1)*sx)
df=N-2
tstar=abs(qt((1-C)/2,df))
CI=c(b1-tstar*SEb1,b1,b1+tstar*SEb1)
Confs<-rbind(Confs,CI)
}

Count=0
left=min(Confs[,1])
right=max(Confs[,3])
Count=0
plot(c(left,left,right,right),c(Trials+1,0,0,Trials+1),type="l",main=paste(Trials,"Confidence",C), ylab="Trial",xlab="Slope") 
for (j in 1:Trials){
points(Confs[j,c(1,3)],c(j,j),type="l",col="red")
points(Confs[j,1],c(j),pch=18)
points(Confs[j,3],c(j),pch=18)
#points(Confs[j,1],c(j),pch=18)
{if (Confs[j,1]>beta1)
Count=Count+1}
{if (Confs[j,3]<beta1)
Count=Count+1}
#points(Confs[j,1],c(j+1),pch=40)
#points(Confs[j,3],c(j+1),pch=41)
}
Count
points(c(beta1,beta1),c(0,Trials+1),type="l",col="blue")


#########

#It looks pretty good!  Now we can also test a hypothesis since under H0: beta1=b, we have that (b1-b)/SEb1=t with N-2 degrees of freedom.  Let's explore this in our above case using b=0 and N=5 and H1:beta1>0.

N=5
beta1=2
beta0=-42
X=rnorm(N,37.5,5)
E=rnorm(N,0,5)
Y=beta1*X+beta0+E
L=lm(Y~X)
b1=L[[1]][2]
b0=L[[1]][1]
Yhat=(b1*X+b0)
Res<-Y-Yhat
my=mean(Y)
se=sqrt(sum((Res)^2)/(N-2))
sx=sd(X)
SEb1=se/(sx*sqrt(N-1))
df=N-2
t=(b1-0)/(SEb1)
P=1-pt(t,df)
P

# By running this a few times, we see that even for N=5 we often find our selves in a good position to reject the Null hypothesis (which we'd like to since beta1=2).