Solutions for Homework sheet2

Problem 1

Part (a)

In [1]:
# read the data 
from read_data_fx import read_tb_data
data=read_tb_data(1,27)
# Age is X
X=data[:,1]
# Muscle mass is Y
Y=data[:,0]; 
In [2]:
# import the packages required 
import numpy as np
import scipy as sc
import statsmodels.api as sm
import matplotlib.pyplot as plt 


barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat # Residuals 
sum_of_residuals = np.sum(e_i) 
sum_of_squares_of_residuals = np.sum(e_i**2) 


print 'check all the six properties for the fitted line'

print 'sum of residuals = %4.4f' % sum_of_residuals, ' ---property 1'
print 'sum of squares of residuals %4.4f' %sum_of_squares_of_residuals, '  ---property 2'
print 'sum of Y_i %4.4f' % np.sum(Y), '  ---property 3'
print 'sum of Yhat %4.4f' % np.sum(Yhat), '  ---property 3'
print 'sum of X_ie_i %4.4f' % np.sum(X*e_i), '  ---property 4'
print 'sum of Yhat e_i %4.4f' % np.sum(Yhat*e_i), '   ---property 5'




%matplotlib inline 
plt.figure(figsize=(20,10)) 
# scatter plot
plt.scatter(X,Y,c='g',s=220,marker='s')
#make matplotlib inline 
# xlabel of the scatter plot
plt.xlabel('Age',fontsize=25)
# ylabel of the scatter plot
plt.ylabel('Muscle Mass',fontsize=25)
# title of the scatter plot
plt.title('Age Vs. Muscle Mass',fontsize=25) 
# regression line
# point barX and barY
plt.plot(barX,barY,marker='*',markersize=70,color='b')
#regression line
plt.plot(X,Yhat,'r-',linewidth=2)
print 'Blue point repersents the point xbar and ybar -- property 6'
check all the six properties for the fitted line
sum of residuals = 0.0000  ---property 1
sum of squares of residuals 3874.4475   ---property 2
sum of Y_i 5098.0000   ---property 3
sum of Yhat 5098.0000   ---property 3
sum of X_ie_i 0.0000   ---property 4
sum of Yhat e_i 0.0000    ---property 5
Blue point repersents the point xbar and ybar -- property 6
Estimated parameters for the line
In [3]:
print 'b0=%4.5f' % b0, 'and', 'b1=%4.5f' % b1
b0=156.34656 and b1=-1.19000
Does a linear regression function appear to give a good fit here? Does your plot support the anticipation that muscle mass decreases with age?

Linear regression function does appear to be good fit. We observe a linear decrease in muscle mass with age. We obtain a negative value for $b_1$, which suports that with age muscle mass decreases.

Part b

(1) A point estimate of the difference in the mean muscle mass for women differing in age by one year.

We know $\hat{Y}_i= b_0+b_1{X}_i $ and $\hat{Y}_{i+1}-\hat{Y}_i = b_1({X}_{i+1}-X_i)$. We are given the age difference to be 1 year i.e., here ${X}_{i+1}-X_i=1$. There in the given case we will have $\hat{Y}_{i+1}-\hat{Y}_i = b_1$. Therefore, the difference in mean muscle mass for women differing in age by one year is $b_1=−1.19$

(2) Point estimate of the mean muscle mass for women aged X=60 years.

This will be given by $b_0+b_1 60$

In [4]:
Yhat_60 = b0 + 60*b1
print 'Point estimate of the mean muscle mass for women aged X=60 years is %4.2f' %Yhat_60
Point estimate of the mean muscle mass for women aged X=60 years is 84.95
(3) The value of the residual for the eighth case
In [5]:
print 'value of the residual for the eighth case=%4.3f' %e_i[7]
value of the residual for the eighth case=4.443
(4) A point estimate of $\sigma^2$

Point estimate for $\sigma^2$ is given by

$s^2=MSE=\frac{SSE}{n-2}=\frac{\sum{(Y_i-\hat{Y_i})^2}}{n-2}$

In [6]:
n=len(X); SSE=sum_of_squares_of_residuals; MSE=sum_of_squares_of_residuals/(n-2); 
print 'point estimate of \sigma^2 =%4.1f' %MSE
point estimate of \sigma^2 =66.8

Problem 2

In [7]:
#read the data file
with open('temp1.txt','r') as f:
    lines = f.readlines()
    f.close()
    
# Removing '\n' and splitting a line in columns 
linesX = lines[1].strip('\n')
line = linesX.split('\t')    


# Copy data into a numpy arary 

line=[];xdata = np.zeros(len(lines)-1,dtype={'names':['temp', 'lat', 'long'], 'formats':['i4', 'f8', 'f8']})
for j in range(1,len(lines)):
    line=lines[j].strip('\n');line=line.split('\t');
    xdata[j-1]=(line[1],line[2],line[3])

    
# latitude is X (predictor variable), temprature is Y(response variable) and Z is longitude (color)  
X = xdata['lat']; Y = xdata['temp']; Z= xdata['long']
In [8]:
# Estimate the regression function 

barX = np.mean(X); barY = np.mean(Y)
XminusbarX = X - barX; YminusbarY = Y - barY
b1 = sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0 = barY - b1*barX
Yhat=b0+b1*X

print 'b0=%4.5f' % b0, 'and', 'b1=%4.5f' % b1
b0=108.72774 and b1=-2.10959
a. Estimate regression function and plot it on scatter plot of X,Y. The color of each marker should represent the longitude.
c. Show that your fitted line goes through the point $\overline{X}$,$ \overline{Y}$
In [9]:
plt.figure(figsize=(20,10)) 
plt.scatter(X,Y,c =Z,s = 150,cmap='winter')


plt.xlabel('Latitude', fontsize=25)
plt.ylabel('Temperature',fontsize=25)
plt.title('Regression of latitude vs temperature',fontsize=25)
cb=plt.colorbar()
cb.set_label('Longitude',fontsize=25)
# point barX and barY
#regression line
plt.plot(X,Yhat,'k-',linewidth=2)
plt.plot(barX,barY,marker='*',markersize=40,color='r')
Out[9]:
[<matplotlib.lines.Line2D at 0x10f5de590>]
b. Plot distribution of residuals
In [10]:
# residulas 
e_i= Y - Yhat

# Distribution of residulas 

n, bins, patches=plt.hist(e_i,100,normed='true',color='cyan')
plt.xlabel('Residuals',fontsize=25)
plt.ylabel('Frequency',fontsize=25)
Out[10]:
<matplotlib.text.Text at 0x10f98aa50>

Problem 3

Part (a)

Derivation of $b_0$ and $b_1$:

To find the estimate of the regression parameter $\beta_0$ and $\beta_1$ we use the method of least squares. Where we minimize the quantity:

\begin{equation} Q=\sum_{i=1}^{n} (Y_i-\beta_0-\beta_1X_i)^2 \end{equation}

Let say the minimum of $Q$ for $\beta_0$ and $\beta_1$ occurs at $b_0$ and $b_1$ respectively then

$\frac{\partial{Q}}{\partial{\beta_0}} \bigg|_{\beta_0=b_0} =0$ and $\frac{\partial{Q}}{\partial{\beta_1}} \bigg|_{\beta_1=b_1} =0$ and we know that

\begin{equation} \frac{\partial{Q}}{\partial{\beta_0}}=-2\sum(Y_i-\beta_0-\beta_1X_i) \end{equation}

\begin{equation} \frac{\partial{Q}}{\partial{\beta_1}}=-2\sum X_i(Y_i-\beta_0-\beta_1X_i). \end{equation}

Therefore,

\begin{equation} -2\sum(Y_i-b_0-b_1X_i)=0
\tag{1} \end{equation}

\begin{equation} -2\sum X_i(Y_i-b_0-b_1X_i)=0
\tag{2} \end{equation}

Above two equations can be simplified into

normal equations

\begin{equation} \sum Y_i= nb_0+b_1\sum X_i
\tag{3} \end{equation}

\begin{equation} \sum X_iY_i=b_0\sum Xi+b_1\sum {X_i}^2
\tag{4} \end{equation}

From equation (3) we obtain

\begin{equation} b_0=\frac{1}{n}\bigg(\sum Y_i-b_1\sum X_i\bigg)
\end{equation}

That is

\begin{equation} \boxed{b_0=\overline{Y}-b_1\overline{X}} \end{equation}

From equation (4) on subsituting $b_0$ we obtain

$b_1=\frac{\sum{X_i}{Y_i}- \overline{Y}\sum{X_i}}{\sum {X_i}^2-\overline{X} \sum{X}_i} = \frac{\sum{X_i}{Y_i}-n\overline{X}\overline{Y}}{\sum {X_i}^2-n\overline{X}^2} = \frac{\sum({X_i}{Y_i}-X_i\overline{Y})+ \sum\overline{X}\overline{Y} -\sum Y_i\overline{X} }{\sum ({X_i}^2-X_i\overline{X})+ \sum\overline{X}^2-\sum X_i\overline{X} } $

This expression can be further simplified into

\begin{equation} \boxed{b_1=\frac{\sum(X_i-\overline{X})(Y_i-\overline{Y})}{\sum(X_i-\overline{X})^2} } \end{equation}

Residuals

Residuals are defined as $e_i=Y_i-\hat{Y_i}$, where $\hat{Y_i}=b_0+b_1X_i$.

Property 1

\begin{equation} \sum{e_i}=0 \end{equation}

Proof:

$\sum{e_i}=\sum(Y_i-\hat{Y_i})=\sum(Y_i-b_0-b_1X_i)=n\overline{Y}-nb_0-nb_1\overline{X}=n\overline{Y}-n(\overline{Y}-b_1\overline{X})-nb_1\overline{X}=0 $

Property 2

$\sum{e_i}^2$ is minimum

Proof:

$b_0$ and $b_1$ are those values of $\beta_0$ and $\beta_1$ respectively that minimizes $\sum_{i=1}^{n} (Y_i-\beta_0-\beta_1X_i)^2$. Therefore, $\sum_{i=1}^{n} (Y_i-b_0-b_1X_i)^2$ is minimum i.e., $\sum{e_i}^2$ is minimum.

Property 3

\begin{equation} \sum Y_i=\sum \hat{Y}_i \end{equation}

Proof:

From property 2 we know $\sum{e_i}=0$ or $\sum{(Y_i-\hat{Y_i})}=0$ implying that

$\sum Y_i=\sum \hat{Y}_i$

Property 4

\begin{equation} \sum{X_ie_i}=0 \end{equation}

Proof:

$\sum{X_ie_i}=\sum{X_i(Y_i-\hat{Y_i})}= \sum{X_i (Y_i-b_0-b_1X_i)}$

We see from Eq. (2) above that $\sum{X_i (Y_i-b_0-b_1X_i)}=0$, therefore we can conclude that $\sum{X_ie_i}=0$

Property 5

\begin{equation} \sum{\hat{Y_i}e_i}=0 \end{equation}

Proof:

$\sum{\hat{Y_i}e_i}=\sum{(b_0+b_1X_i)e_i } =b_0\sum {e_i} + b_1 \sum {X_ie_i} $

From property 1 and property 4 we know that $\sum {e_i}=0$ and $\sum {X_ie_i}=0$, hence

$\sum{\hat{Y_i}e_i}=0$

Property 6

The regression line always goes through the point $(\overline{X},\overline{Y})$

Proof:

$\hat{Y_i}=b_0+b_1X_i$, after substituting for $b_0$ then we have $\hat{Y_i}=\overline{Y}-b_1 \overline{X} +b_1X_i$ i.e.,

$\hat{Y_i}=\overline{Y}+b_1(X_i-\overline{X})$

This equation is satisfied for $(\overline{X},\overline{Y})$, therefore we can conclude that regression $\hat{Y_i}=b_0+b_1X_i$ always goes through $(\overline{X},\overline{Y})$.

Part (b)

The normal error regression model is as follows:

\begin{equation} Y_i=\beta_0+\beta_1X_i+\epsilon_i \end{equation}

Where $Y_i$ is the observed response in the $i$th trial

$X_i$ is the level of predictor variable in the $i$th trial

$\beta_0$ and $\beta_1$ are parameters

$\epsilon_i$ are independent $N(0,\sigma^2)$

$i=1,\dots,n$

Maximum likelihood estimator for $\sigma^2$ is

$\hat{\sigma}^2 =\frac{\sum{(Y_i-\hat{Y_i})^2}}{n}$ and $\hat{\sigma}^2$ is a biased estimator with n degrees of freedom.

Least squares estimator for $\sigma^2$ is

$s^2=\frac{\sum{(Y_i-\hat{Y_i})^2}}{n-2}$ and $s^2$ is an unbiased estimator with n-2 degrees of freedom.