- No meetup next week.
- There will not be a lab or homework for the Bayesian unit.
- Project proposals have been graded. Final projects are due May 15th.
April 17, 2019
data(allbacks, package='DAAG') head(allbacks)
## volume area weight cover ## 1 885 382 800 hb ## 2 1016 468 950 hb ## 3 1125 387 1050 hb ## 4 239 371 350 hb ## 5 701 371 750 hb ## 6 641 367 600 hb
From: Maindonald, J.H. & Braun, W.J. (2007). Data Analysis and Graphics Using R, 2nd ed.
lm.out <- lm(weight ~ volume, data=allbacks)
\[ \hat{weight} = 108 + 0.71 volume \] \[ R^2 = 80\% \]
summary(lm.out)
## ## Call: ## lm(formula = weight ~ volume, data = allbacks) ## ## Residuals: ## Min 1Q Median 3Q Max ## -189.97 -109.86 38.08 109.73 145.57 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 107.67931 88.37758 1.218 0.245 ## volume 0.70864 0.09746 7.271 6.26e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 123.9 on 13 degrees of freedom ## Multiple R-squared: 0.8026, Adjusted R-squared: 0.7875 ## F-statistic: 52.87 on 1 and 13 DF, p-value: 6.262e-06
lm.out2 <- lm(weight ~ volume + cover, data=allbacks) summary(lm.out2)
## ## Call: ## lm(formula = weight ~ volume + cover, data = allbacks) ## ## Residuals: ## Min 1Q Median 3Q Max ## -110.10 -32.32 -16.10 28.93 210.95 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 197.96284 59.19274 3.344 0.005841 ** ## volume 0.71795 0.06153 11.669 6.6e-08 *** ## coverpb -184.04727 40.49420 -4.545 0.000672 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 78.2 on 12 degrees of freedom ## Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154 ## F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07
\[ \hat{weight} = 198 + 0.72 volume - 184 coverpb \]
For hardcover books: plug in 0 for cover.
\[ \hat{weight} = 197.96 + 0.72 volume - 184.05 \times 0 = 197.96 + 0.72 volume \]
For paperback books: plut in 1 for cover. \[ \hat{weight} = 197.96 + 0.72 volume - 184.05 \times 1 \]
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 197.9628 | 59.1927 | 3.34 | 0.0058 |
volume | 0.7180 | 0.0615 | 11.67 | 0.0000 |
coverpb | -184.0473 | 40.4942 | -4.55 | 0.0007 |
poverty <- read.table("../Data/poverty.txt", h = T, sep = "\t") names(poverty) <- c("state", "metro_res", "white", "hs_grad", "poverty", "female_house") poverty <- poverty[,c(1,5,2,3,4,6)] head(poverty)
## state poverty metro_res white hs_grad female_house ## 1 Alabama 14.6 55.4 71.3 79.9 14.2 ## 2 Alaska 8.3 65.6 70.8 90.6 10.8 ## 3 Arizona 13.3 88.2 87.7 83.8 11.1 ## 4 Arkansas 18.0 52.5 81.0 80.9 12.1 ## 5 California 12.8 94.4 77.5 81.1 12.6 ## 6 Colorado 9.4 84.5 90.2 88.7 9.6
From: Gelman, H. (2007). Data Analysis using Regression and Multilevel/Hierarchial Models. Cambridge University Press.
lm.poverty <- lm(poverty ~ female_house, data=poverty) summary(lm.poverty)
## ## Call: ## lm(formula = poverty ~ female_house, data = poverty) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7537 -1.8252 -0.0375 1.5565 6.3285 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.3094 1.8970 1.745 0.0873 . ## female_house 0.6911 0.1599 4.322 7.53e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.664 on 49 degrees of freedom ## Multiple R-squared: 0.276, Adjusted R-squared: 0.2613 ## F-statistic: 18.68 on 1 and 49 DF, p-value: 7.534e-05
\(R^2\) can be calculated in three ways:
Using ANOVA we can calculate the explained variability and total variability in y.
anova.poverty <- anova(lm.poverty) print(xtable(anova.poverty, digits = 2), type='html')
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
female_house | 1.00 | 132.57 | 132.57 | 18.68 | 0.00 |
Residuals | 49.00 | 347.68 | 7.10 |
Sum of squares of y: \({ SS }_{ Total }=\sum { { \left( y-\bar { y } \right) }^{ 2 } } =480.25\) → total variability
Sum of squares of residuals: \({ SS }_{ Error }=\sum { { e }_{ i }^{ 2 } } =347.68\) → unexplained variability
Sum of squares of x: \({ SS }_{ Model }={ SS }_{ Total }-{ SS }_{ Error } = 132.57\) → explained variability
\[ R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y} = \frac{132.57}{480.25} = 0.28 \]
lm.poverty2 <- lm(poverty ~ female_house + white, data=poverty) print(xtable(lm.poverty2), type='html')
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -2.5789 | 5.7849 | -0.45 | 0.6577 |
female_house | 0.8869 | 0.2419 | 3.67 | 0.0006 |
white | 0.0442 | 0.0410 | 1.08 | 0.2868 |
anova.poverty2 <- anova(lm.poverty2) print(xtable(anova.poverty2, digits = 2), type='html')
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
female_house | 1.00 | 132.57 | 132.57 | 18.74 | 0.00 |
white | 1.00 | 8.21 | 8.21 | 1.16 | 0.29 |
Residuals | 48.00 | 339.47 | 7.07 |
\[ R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y} = \frac{132.57 + 8.21}{480.25} = 0.29 \]
Does adding the variable white
to the model add valuable information that wasn’t provided by female_house
?
poverty vs % female head of household
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 3.3094 | 1.8970 | 1.74 | 0.0873 |
female_house | 0.6911 | 0.1599 | 4.32 | 0.0001 |
poverty vs % female head of household and % female household
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -2.5789 | 5.7849 | -0.45 | 0.6577 |
female_house | 0.8869 | 0.2419 | 3.67 | 0.0006 |
white | 0.0442 | 0.0410 | 1.08 | 0.2868 |
Note the difference in the estimate for female_house
.
Two predictor variables are said to be collinear when they are correlated, and this collinearity complicates model estimation.
Remember: Predictors are also called explanatory or independent variables. Ideally, they would be independent of each other.
We don’t like adding predictors that are associated with each other to the model, because often times the addition of such variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model.
While it’s impossible to avoid collinearity from arising in observational data, experiments are usually designed to prevent correlation among predictors
Model | \(R^2\) | Adjusted \(R^2\) |
---|---|---|
Model 1 (Single-pridictor) | 0.28 | 0.26 |
Model 2 (Multiple) | 0.29 | 0.26 |
\[ { R }_{ adj }^{ 2 }={ 1-\left( \frac { { SS }_{ error } }{ { SS }_{ total } } \times \frac { n-1 }{ n-p-1 } \right) } \]
where n is the number of cases and p is the number of predictors (explanatory variables) in the model.
\[ \sigma \left( t \right) =\frac { { e }^{ t } }{ { e }^{ t }+1 } =\frac { 1 }{ 1+{ e }^{ -t } } \]
logistic <- function(t) { return(1 / (1 + exp(-t))) } df <- data.frame(x=seq(-4, 4, by=0.01)) df$sigma_t <- logistic(df$x) plot(df$x, df$sigma_t)
\[ t = \beta_0 + \beta_1 x \]
The logistic function can now be rewritten as
\[ F\left( x \right) =\frac { 1 }{ 1+{ e }^{ -\left( { \beta }_{ 0 }+\beta _{ 1 }x \right) } } \]
We use the same ordinary least squares procedure we used before. That is, we wish to minimize the sum of squared residuals.
study <- data.frame( Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50), Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1) ) lr.out <- glm(Pass ~ Hours, data=study, family=binomial) lr.out
## ## Call: glm(formula = Pass ~ Hours, family = binomial, data = study) ## ## Coefficients: ## (Intercept) Hours ## -4.078 1.505 ## ## Degrees of Freedom: 19 Total (i.e. Null); 18 Residual ## Null Deviance: 27.73 ## Residual Deviance: 16.06 AIC: 20.06
study[1,]
## Hours Pass ## 1 0.5 0
logistic <- function(x, b0, b1) { return(1 / (1 + exp(-1 * (b0 + b1 * x)) )) } logistic(.5, b0=-4.078, b1=1.505)
## [1] 0.03470667
Of course, the fitted
function will do the same:
study$fitted <- fitted(lr.out) study[1,]
## Hours Pass fitted ## 1 0.5 0 0.03471034
binomial_smooth <- function(...) { geom_smooth(method = "glm", method.args = list(family = "binomial"), ...) } ggplot(study, aes(x=Hours, y=Pass)) + geom_point() + binomial_smooth(se=FALSE)