--- title: "Math 50 Fall 2017, Homework #6" ---
__NOTE: For your homework download and use the template__ (https://math.dartmouth.edu/~m50f17/HW6.Rmd) __Read the green comments in the rmd file to see where your answers should go.__


#### Lets first look at the scatter plot for the windmill data, and visually check the straight line fit. ```{r} windmill <- read.table("https://math.dartmouth.edu/~m50f17/windmill.csv", header=T) plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current") fit <- lm(DC~velocity, data = windmill) abline(fit$coefficients, col="red") ``` #### The summary statistics are below, $R^2$ is about 0.87. The residual plot below suggests that the relation might be non-linear. When you look at the above scatter diagram one might think the straight line model seems OK, however the residual plot below amplifies the nonlinearity. Why? Can we also see this by carefully looking at the scatter plot above? ```{r} summary(fit) plot(fitted.values(fit), rstudent(fit), xlab = "y", ylab = "R-Student residuals", main = "Windmill - Residual Plot") abline(c(0,0), col="red") ``` #### Also note that it looks like that there is a potential outlier and however this might change when we fix the model. It seems consistent with the rest (visually). Start with fitting a quadratic model. ```{r} fit2 <- lm(DC~poly(velocity, degree = 2), data = windmill) summary(fit2) plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current") lines(sort(windmill$velocity), fitted(fit2)[order(windmill$velocity)], col='red') ``` #### This seems to fix the curved nature of the data, however the application domain suggests to use a model of the form $$ y = \beta_0 + \beta_1 \frac{1}{x} + \varepsilon $$ Note that there doesn't seem a potential outlier in the new model. ```{r} velRep = 1/windmill$velocity DC <- windmill$DC plot(velRep, windmill$DC, xlab = "1/velocity", ylab = "DC current") fit3 <- lm(DC~velRep) abline(fit3$coefficients, col="red") summary(fit3) plot(fitted.values(fit), rstudent(fit3), xlab = "fitted values", ylab = "Studentized residuals", main = "Residuals - reciprocal model") abline(c(0,0), col="red") ```


## Question-1 Recall the phytoplankton population data is given at : https://math.dartmouth.edu/~m50f17/phytoplankton.csv where headers are - pop : population of phytoplankton ($y$) - subs2 : concentration of substance-2 ($x$) (a) Plot the scatter diagram for pop ~ subs2. Do you think a straight line model is adequate? Fit a straight line model and support your argument with summary statistics. (b) Do you suggest to use Box-Cox method? If not explain, if so apply the method and demonstrate the improvement. (c) An analyst suggests to use the following model $$ y = \beta_0 + \beta_1 (x-4.5)^2 $$ Using transformations, fit a simple linear regression model. Plot the scatter diagram and fitted curve (Note: it is not a straight line in this case). Compare $MS_{res}$, $R^2$ and the R-student residual plots with the model in part a. (d) Construct the probability plot for part (c). Is there a problem with the normality assumption? If so determine the problem (heavy tailed, light tailed, or something else) ### Answer: ```{r} pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",") pop <- pData$pop subs1 <- pData$subs1 subs2 <- pData$subs2 fitted=lm(pop~subs2) plot (subs2, pop) abline(fitted$coefficients) qqnorm(rstudent(fitted), datax = TRUE , main="Normal Probability Plot") qqline(rstudent(fitted), datax = TRUE ) summary(fitted) cat("Part a. Straight line model doesn't seem to be adequate which can be seen visually, also verly low R2.") library(MASS) bc <- boxcox(pop ~ subs2, lambda = seq(1,6,0.1)) lambda <- bc$x[which.max(bc$y)] fittedBC=lm(pop^lambda ~subs2) summary(fittedBC) plot (subs2, pop^lambda) abline(fittedBC$coefficients) plot(predict(fitted), rstudent(fitted) ) plot(predict(fittedBC), rstudent(fittedBC) ) cat("Part b. Both normality and constant variance seems violated from the plots, however these might be due to fitting a straight line model to a nonlinear relation. After trying Box-Cox, even though R2 improved significantly we can't see improvement in normality, also constant variance assumption seems worse after boxcox, and still not helpful for the nonlinear relation.") s2Rep = (subs2-4.5)^2 popT <- pop fit2 <- lm ( popT ~ s2Rep^2 ) summary(fit2) plot (subs2, popT) lines(sort(subs2), fitted(fit2)[order(subs2)], col='red') plot(fitted.values(fit2), rstudent(fit2), xlab = "y", ylab = "R-Student residuals") abline(c(0,0), col="red") summary(fit2) cat("Part c. With the new model, plots suggests improvement in the fitting, significant increase in R2 and improvement in constant variance issues. However normality is still a problem. See below") qqnorm(rstudent(fit2), datax = TRUE , main="Normal Probability Plot") qqline(rstudent(fit2), datax = TRUE ) cat("Part d. Normality assumption seems to be violated. It is heavy tailed on the left and light tailed on the right. ") ```


## Question-2 Chapter 5, Problem 2 all parts. ### Answer: ```{r} library(MPV) data(p5.2) pressure <-p5.2$vapor temperature <- p5.2$temp plot(temperature, pressure) cat("Part a. A Clear nonlinear relations.") fitT <- lm( pressure ~ temperature) plot(temperature,pressure) abline(fitT$coefficients) plot( predict(fitT), rstudent(fitT), main = "R Student Residuals vs y") abline(c(0,0), col="red") summary(fitT) cat("Part b. We can see nonlinear relation in the residuals also, normality assumption doesn't seem to be valid. Straight line model is inadequate overall.") tInv <- (1/temperature) logPressure <- log(pressure) fitTInv <- lm( logPressure ~ tInv) plot(tInv,logPressure) abline(fitTInv$coefficients) plot( predict(fitTInv), rstudent(fitTInv), main = "R Student Residuals vs y (Temp Inv)") abline(c(0,0), col="red") summary(fitTInv) cat("Part c. Fitting seems better with this model, significant improvement in R2 and overall fit visually. Non linear relation is now seems to be tranformed into a linear relation. However residuals doesn't seem to be normal and independently distributed, seems correllated (however this might be result of the nonlinear relation even after the transformation). ") ```


## Question-3 Chapter 5, Problem 3 all parts. Note: In part (c) consider natural log of the ~~minutes~~ number of bacteria. ### Answer: ```{r} data(p5.3) numBacteria <- p5.3$bact minutes <- p5.3$min plot(minutes, numBacteria) cat("Part a. There might be a slightly nonlinear relation between minutes and number of bacteria. Which will be more apparent below. ") fit <- lm( numBacteria ~ minutes ) plot(minutes , numBacteria) abline(fit$coefficients) plot( predict(fit), rstudent(fit), main = "R Student Residuals vs y") abline(c(0,0), col="red") summary(fit) qqnorm(rstudent(fit), datax = TRUE , main="Normal Probability Plot") qqline(rstudent(fit), datax = TRUE ) cat("Part b. There is a clear outlier. Normality assumption seems violated, nonlinear pattern more visible in residual plot. ") fitted <- lm( log(numBacteria) ~ (minutes) ) plot( predict(fitted), rstudent(fitted), main = "R Student Residuals vs y") abline(c(0,0), col="red") summary(fitted) plot((minutes), log(numBacteria)) abline(fitted$coefficients) qqnorm(rstudent(fitted), datax = TRUE , main="Normal Probability Plot") qqline(rstudent(fitted), datax = TRUE ) cat("Part c. The plots and summary shows much better fitting with high R2. Residual plot seems fine and fitting is satisfactory. Normality assumption also seems fine.") ```