# read the data
from read_data_fx import read_tb_data
data=read_tb_data(1,27)
# Age is X
X=data[:,1]
# Muscle mass is Y
Y=data[:,0];
# import the packages required
import numpy as np
import scipy as sc
import statsmodels.api as sm
import matplotlib.pyplot as plt
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat # Residuals
sum_of_residuals = np.sum(e_i)
sum_of_squares_of_residuals = np.sum(e_i**2)
print 'check all the six properties for the fitted line'
print 'sum of residuals = %4.4f' % sum_of_residuals, ' ---property 1'
print 'sum of squares of residuals %4.4f' %sum_of_squares_of_residuals, ' ---property 2'
print 'sum of Y_i %4.4f' % np.sum(Y), ' ---property 3'
print 'sum of Yhat %4.4f' % np.sum(Yhat), ' ---property 3'
print 'sum of X_ie_i %4.4f' % np.sum(X*e_i), ' ---property 4'
print 'sum of Yhat e_i %4.4f' % np.sum(Yhat*e_i), ' ---property 5'
%matplotlib inline
plt.figure(figsize=(20,10))
# scatter plot
plt.scatter(X,Y,c='g',s=220,marker='s')
#make matplotlib inline
# xlabel of the scatter plot
plt.xlabel('Age',fontsize=25)
# ylabel of the scatter plot
plt.ylabel('Muscle Mass',fontsize=25)
# title of the scatter plot
plt.title('Age Vs. Muscle Mass',fontsize=25)
# regression line
# point barX and barY
plt.plot(barX,barY,marker='*',markersize=70,color='b')
#regression line
plt.plot(X,Yhat,'r-',linewidth=2)
print 'Blue point repersents the point xbar and ybar -- property 6'
print 'b0=%4.5f' % b0, 'and', 'b1=%4.5f' % b1
Linear regression function does appear to be good fit. We observe a linear decrease in muscle mass with age. We obtain a negative value for $b_1$, which suports that with age muscle mass decreases.
We know $\hat{Y}_i= b_0+b_1{X}_i $ and $\hat{Y}_{i+1}-\hat{Y}_i = b_1({X}_{i+1}-X_i)$. We are given the age difference to be 1 year i.e., here ${X}_{i+1}-X_i=1$. There in the given case we will have $\hat{Y}_{i+1}-\hat{Y}_i = b_1$. Therefore, the difference in mean muscle mass for women differing in age by one year is $b_1=−1.19$
This will be given by $b_0+b_1 60$
Yhat_60 = b0 + 60*b1
print 'Point estimate of the mean muscle mass for women aged X=60 years is %4.2f' %Yhat_60
print 'value of the residual for the eighth case=%4.3f' %e_i[7]
Point estimate for $\sigma^2$ is given by
$s^2=MSE=\frac{SSE}{n-2}=\frac{\sum{(Y_i-\hat{Y_i})^2}}{n-2}$
n=len(X); SSE=sum_of_squares_of_residuals; MSE=sum_of_squares_of_residuals/(n-2);
print 'point estimate of \sigma^2 =%4.1f' %MSE
#read the data file
with open('temp1.txt','r') as f:
lines = f.readlines()
f.close()
# Removing '\n' and splitting a line in columns
linesX = lines[1].strip('\n')
line = linesX.split('\t')
# Copy data into a numpy arary
line=[];xdata = np.zeros(len(lines)-1,dtype={'names':['temp', 'lat', 'long'], 'formats':['i4', 'f8', 'f8']})
for j in range(1,len(lines)):
line=lines[j].strip('\n');line=line.split('\t');
xdata[j-1]=(line[1],line[2],line[3])
# latitude is X (predictor variable), temprature is Y(response variable) and Z is longitude (color)
X = xdata['lat']; Y = xdata['temp']; Z= xdata['long']
# Estimate the regression function
barX = np.mean(X); barY = np.mean(Y)
XminusbarX = X - barX; YminusbarY = Y - barY
b1 = sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0 = barY - b1*barX
Yhat=b0+b1*X
print 'b0=%4.5f' % b0, 'and', 'b1=%4.5f' % b1
plt.figure(figsize=(20,10))
plt.scatter(X,Y,c =Z,s = 150,cmap='winter')
plt.xlabel('Latitude', fontsize=25)
plt.ylabel('Temperature',fontsize=25)
plt.title('Regression of latitude vs temperature',fontsize=25)
cb=plt.colorbar()
cb.set_label('Longitude',fontsize=25)
# point barX and barY
#regression line
plt.plot(X,Yhat,'k-',linewidth=2)
plt.plot(barX,barY,marker='*',markersize=40,color='r')
# residulas
e_i= Y - Yhat
# Distribution of residulas
n, bins, patches=plt.hist(e_i,100,normed='true',color='cyan')
plt.xlabel('Residuals',fontsize=25)
plt.ylabel('Frequency',fontsize=25)
Derivation of $b_0$ and $b_1$:
To find the estimate of the regression parameter $\beta_0$ and $\beta_1$ we use the method of least squares. Where we minimize the quantity:
\begin{equation} Q=\sum_{i=1}^{n} (Y_i-\beta_0-\beta_1X_i)^2 \end{equation}Let say the minimum of $Q$ for $\beta_0$ and $\beta_1$ occurs at $b_0$ and $b_1$ respectively then
$\frac{\partial{Q}}{\partial{\beta_0}} \bigg|_{\beta_0=b_0} =0$ and $\frac{\partial{Q}}{\partial{\beta_1}} \bigg|_{\beta_1=b_1} =0$ and we know that
\begin{equation} \frac{\partial{Q}}{\partial{\beta_0}}=-2\sum(Y_i-\beta_0-\beta_1X_i) \end{equation}\begin{equation} \frac{\partial{Q}}{\partial{\beta_1}}=-2\sum X_i(Y_i-\beta_0-\beta_1X_i). \end{equation}
Therefore,
\begin{equation}
-2\sum(Y_i-b_0-b_1X_i)=0
\tag{1}
\end{equation}
\begin{equation}
-2\sum X_i(Y_i-b_0-b_1X_i)=0
\tag{2}
\end{equation}
Above two equations can be simplified into
\begin{equation}
\sum Y_i= nb_0+b_1\sum X_i
\tag{3}
\end{equation}
\begin{equation}
\sum X_iY_i=b_0\sum Xi+b_1\sum {X_i}^2
\tag{4}
\end{equation}
From equation (3) we obtain
\begin{equation}
b_0=\frac{1}{n}\bigg(\sum Y_i-b_1\sum X_i\bigg)
\end{equation}
That is
\begin{equation} \boxed{b_0=\overline{Y}-b_1\overline{X}} \end{equation}
From equation (4) on subsituting $b_0$ we obtain
$b_1=\frac{\sum{X_i}{Y_i}- \overline{Y}\sum{X_i}}{\sum {X_i}^2-\overline{X} \sum{X}_i} = \frac{\sum{X_i}{Y_i}-n\overline{X}\overline{Y}}{\sum {X_i}^2-n\overline{X}^2} = \frac{\sum({X_i}{Y_i}-X_i\overline{Y})+ \sum\overline{X}\overline{Y} -\sum Y_i\overline{X} }{\sum ({X_i}^2-X_i\overline{X})+ \sum\overline{X}^2-\sum X_i\overline{X} } $
This expression can be further simplified into
\begin{equation} \boxed{b_1=\frac{\sum(X_i-\overline{X})(Y_i-\overline{Y})}{\sum(X_i-\overline{X})^2} } \end{equation}
Residuals are defined as $e_i=Y_i-\hat{Y_i}$, where $\hat{Y_i}=b_0+b_1X_i$.
\begin{equation} \sum{e_i}=0 \end{equation}
Proof:
$\sum{e_i}=\sum(Y_i-\hat{Y_i})=\sum(Y_i-b_0-b_1X_i)=n\overline{Y}-nb_0-nb_1\overline{X}=n\overline{Y}-n(\overline{Y}-b_1\overline{X})-nb_1\overline{X}=0 $
$\sum{e_i}^2$ is minimum
Proof:
$b_0$ and $b_1$ are those values of $\beta_0$ and $\beta_1$ respectively that minimizes $\sum_{i=1}^{n} (Y_i-\beta_0-\beta_1X_i)^2$. Therefore, $\sum_{i=1}^{n} (Y_i-b_0-b_1X_i)^2$ is minimum i.e., $\sum{e_i}^2$ is minimum.
\begin{equation} \sum Y_i=\sum \hat{Y}_i \end{equation}
Proof:
From property 2 we know $\sum{e_i}=0$ or $\sum{(Y_i-\hat{Y_i})}=0$ implying that
$\sum Y_i=\sum \hat{Y}_i$
\begin{equation} \sum{X_ie_i}=0 \end{equation}
Proof:
$\sum{X_ie_i}=\sum{X_i(Y_i-\hat{Y_i})}= \sum{X_i (Y_i-b_0-b_1X_i)}$
We see from Eq. (2) above that $\sum{X_i (Y_i-b_0-b_1X_i)}=0$, therefore we can conclude that $\sum{X_ie_i}=0$
\begin{equation} \sum{\hat{Y_i}e_i}=0 \end{equation}
Proof:
$\sum{\hat{Y_i}e_i}=\sum{(b_0+b_1X_i)e_i } =b_0\sum {e_i} + b_1 \sum {X_ie_i} $
From property 1 and property 4 we know that $\sum {e_i}=0$ and $\sum {X_ie_i}=0$, hence
$\sum{\hat{Y_i}e_i}=0$
The regression line always goes through the point $(\overline{X},\overline{Y})$
Proof:
$\hat{Y_i}=b_0+b_1X_i$, after substituting for $b_0$ then we have $\hat{Y_i}=\overline{Y}-b_1 \overline{X} +b_1X_i$ i.e.,
$\hat{Y_i}=\overline{Y}+b_1(X_i-\overline{X})$
This equation is satisfied for $(\overline{X},\overline{Y})$, therefore we can conclude that regression $\hat{Y_i}=b_0+b_1X_i$ always goes through $(\overline{X},\overline{Y})$.
The normal error regression model is as follows:
\begin{equation} Y_i=\beta_0+\beta_1X_i+\epsilon_i \end{equation}Where $Y_i$ is the observed response in the $i$th trial
$X_i$ is the level of predictor variable in the $i$th trial
$\beta_0$ and $\beta_1$ are parameters
$\epsilon_i$ are independent $N(0,\sigma^2)$
$i=1,\dots,n$
Maximum likelihood estimator for $\sigma^2$ is
$\hat{\sigma}^2 =\frac{\sum{(Y_i-\hat{Y_i})^2}}{n}$ and $\hat{\sigma}^2$ is a biased estimator with n degrees of freedom.
Least squares estimator for $\sigma^2$ is
$s^2=\frac{\sum{(Y_i-\hat{Y_i})^2}}{n-2}$ and $s^2$ is an unbiased estimator with n-2 degrees of freedom.