R studio Project Questions:
Project Task:
Dataset caschool.csv contains data on test performance,
school character- istics and student demographic backgrounds. In this project,
your goal is to understand the relationship between test scores and average
income across districts.
(a) Load the caschool.csv data set in R. Plot a scatterplot
of test scores (testscr) against average income (avginc). Calculate the
correlation between the two variables.
(b) Estimate a simple regression of testscr on avginc.
Report the es- timated regression, including standard errors of estimated
coefficients and measures of fit. Are the coefficients significant? Interpret
the co- efficients of the regression. Does the intercept in this regression
have a meaningful real life interpretation?
(c) By looking at the scatterplot, do you think that the
assumption of linear relationship between testscr and avginc is appropriate?
Estimate a quadratic regression of testscr on avginc. Report the estimated
regression and test for significance of the coefficients. Interpret the
coefficient on avgine.
(d) Is the regression in part (c) better than the regression
in part (b)? Sup- port your answer with a formal test procedure. Using the
preferred model, predict the increase in test scores for two cases: when the av
erage income increases from 10 to 12 and when it increases from 20 to 22.
Compare the two predicted values and comment.
(e) Many factors that may determine test scores are omitted
in regressions in parts (b) and (c). Might this be a problem, given that you
are only interested in relationship between testser and avgine? Discuss. You
may, if you wish, support your discussion by calculating correlations between
suitable pairs of variables.
(f) Run a least squares regression of testscr on avginc,
avginc 2, str, el_pct, calw_pct and meal pct. Report the estimated regression.
Are the coefficients of the latter four variables jointly significant? Has the
value of estimated coefficients on avgine and avgine 2 changed compared to
regression in part (c)? What does this indicate? Relate your answers to your
discussion in part (e).
(g) Summarize briefly what you learned about the relationship of testscr and avginc.
Project 2:
a)
This positive value
between 0 and 1 indicates there is a strong positive linear relationship
between average income (data$avginc) and test scores (data$testscr) in the
data. Specifically, as average income values increase, the corresponding test
score values also tend to increase linearly. A correlation coefficient of
0.7124308 represents a strong positive association between higher average
income and higher test scores.
The output value of
0.7124308 indicates a strong positive association, where higher average income
is linearly related to higher test scores. This correlation measure provides
insight into the type and strength of the relationship between these two
variables in the data.
B)
The estimated intercept
of 625.3836 suggests the expected value of testscr when avginc is zero.
However, this does not have a practical real-life interpretation as avginc
cannot be zero or negative. The estimated slope coefficient of 1.8785 indicates
that for every one unit increase in avginc, the value of testscr increases by
1.8785 units on average, holding all else constant.
The standard errors of
the estimates allow us to check if the coefficients are statistically
significant. The t-values and extremely low p-values show that both coefficients
are highly statistically significant. Measures of fit like Residual Standard
Error and R-squared values indicate that the linear regression model fits the
data well. Approximately 50.76% of the variation in testscr is explained by the
model.
Overall, this linear
regression models the relationship between average income and test scores. Both
the intercept and slope coefficients are statistically significant. The
positive slope shows that higher income is associated with higher test scores
on average based on this data.
c)
First, a scatterplot of
the data is created. Upon visual inspection, the scatterplot suggests that
while there is an overall positive relationship between avginc and testscr, the
trend may not be perfectly linear. By examining the scatterplot of testscr and
avginc, the assumption of a linear relationship between the two variables may
not be entirely appropriate. There does seem to be some curvature in the
pattern of points that could indicate a quadratic relationship better describes
the data.
To test this, a quadratic
regression model was estimated with testscr as the dependent variable and
avginc and its square (avginc^2) as independent variables.
The results show that both avginc and avginc2 are highly statistically significant based on very small reported p-values. This suggests that including the quadratic term improves the fit of the model compared to a simple linear regression.
![]()
The positive coefficient
on avginc indicates that test scores increase with income, but the negative
coefficient on avginc2 reveals that this relationship levels off at
higher income levels as the quadratic term dampens the effect of additional
income increases or we can explain as the negative coefficient for avginc2
suggests that while testscr increases with avginc, it does so at a declining
rate (as avginc increases further). Compared to the linear regression, this
quadratic model explains slightly more variation in test scores as evidenced by
the higher adjusted R-squared of 0.554. The F-statistic further confirms the
overall regression fit is significantly better than the intercept-only model.
d)
To determine which model
(linear or quadratic) is a better fit for the relationship between testscr and
avginc, an ANOVA was performed to formally compare the two nested models. The
ANOVA results show the quadratic model provides a significantly better fit
compared to the linear model based on the extremely small p-value. This
suggests allowing for a nonlinear term via the quadratic specification improves
the fit over the simple linear regression.
Using the preferred
quadratic model, we can predict the change in test scores for two example
increases in average income. When income increases from 10 to 12, the predicted
increase in test scores is 5.8405 points (647.4213 - 641.5808). For an increase
from 20 to 22, the predicted increase is smaller at 4.2181 points (671.5463 -
667.3982). This makes intuitive sense given the quadratic nature of the
relationship - increases have diminishing returns at higher income levels as
the effect levels off. Overall, the quadratic model statistically fits the data
significantly better based on the ANOVA results. And its predictions of smaller
test score increases at higher income levels aligns with the quadratic
functional form, providing face validity for the preferred specified
relationship between the variables in this dataset.
e)
Yes, omitting other
relevant factors that influence test scores could potentially be a problem for
the regressions in parts (b) and (c). While they successfully model the
relationship between test scores and average income, other variables excluded
from the model may partially or fully explain some of the observed variation in
scores.
For example, calculating
correlations shows test scores are more strongly correlated with percentage of
students eligible for free/reduced lunch (el_pct), percentage California
standardized test scores (calw_pct), and percentage receiving meal subsidies
(meal_pct) than with average income alone.
This suggests factors
like socioeconomic status, prior achievement, and nutrition likely also impact
test performance. Excluding them means the regressions may overstate the
influence of income if other factors are not held equal. The analysis so far
only focused on isolating the income-test score relationship. But in reality,
multiple factors simultaneously influence outcomes. By omitting relevant
predictors, there is a possibility of omitted variable bias - some of the
effect currently attributed to income alone may actually be due to other
excluded factors. To get a fuller picture of determinants, a multiple regression
including measures of SES, prior scores, nutrition etc. alongside income could
provide a more accurate model of varied influences on achievement.
f)
In this multiple
regression model, testscr is regressed on avginc, avginc^2, str, el_pct, calw_pct,
and meal_pct. The coefficients for el_pct, str, and meal_pct are statistically
significant based on their reported p-values. However, the coefficients for
avginc, avginc^2, and calw_pct are not significant. A joint F-test shows that
the coefficients for the latter four variables (str, el_pct, calw_pct,
meal_pct) are highly statistically significant together.
Compared to the quadratic
regression in part (c), the estimated coefficients on avginc and avginc^2 have
changed and are no longer statistically significant at 1% significance level. This
suggests that once socioeconomic covariates are controlled for, the apparent
relationship between test scores and average income alone is weakened and no
longer holds.
As discussed in part (e),
omitting these explanatory variables could lead to overstating the effect of
income alone due to potential omitted variable bias. The multiple regression
addresses this limitation by controlling for socioeconomic influences, prior
achievement and nutrition simultaneously. This provides a more accurate and
comprehensive understanding of determinants of test scores compared to the
bivariate regressions. The results support the argument that a simplistic
bivariate analysis could mischaracterize the relationship.
g)
Initially, a simple
linear regression suggested a positive statistically significant relationship
between test scores and average income, with higher income associated with
higher scores. However, this model did not control for any other factors. Examining
a scatterplot revealed the relationship may be better characterized as
quadratic. Estimating a quadratic regression showed income has diminishing
returns on scores at higher levels, and provided a better statistical fit. However,
correlation analysis illustrated other variables like socioeconomic status are
also related to both test scores and income. Failing to account for these could
bias the apparent effect of income found in earlier regressions.
Indeed, when a multiple
regression controlled for variables like free lunch rates, regional
differences, and subsidies, the estimated effects of income were no longer
statistically significant. This indicates the simple bivariate relationship was
misleading once confounding was addressed. Overall, while higher income was
initially linked to higher test scores, the true relationship is more complex
as other socioeconomic factors play a role. Average income alone does not fully
determine test performance. A multiple regression approach provided the clearest
picture of how income impacts scores after controlling for relevant covariates.
R codes:
# A
# Load the caschool.csv
dataset
data <-
read.csv("caschool.csv")
# Plot scatterplot
plot(data$avginc,
data$testscr, xlab = "Average Income", ylab = "Test
Scores")
# Calculate correlation
correlation <-
cor(data$avginc, data$testscr)
correlation
# B
# Simple regression of
testscr on avginc
regression <-
lm(testscr ~ avginc, data = data)
# Report estimated
regression
summary(regression)
# C
# Scatterplot
plot(data$avginc,
data$testscr, xlab = "Average Income", ylab = "Test
Scores")
# Quadratic regression of
testscr on avginc
quadratic_regression
<- lm(testscr ~ avginc + I(avginc^2), data = data)
# Report estimated
regression
summary(quadratic_regression)
# D
# Comparing the two
models using ANOVA
anova_result <-
anova(regression, quadratic_regression)
anova_result
# Predicting increase in
test scores
new_data <-
data.frame(avginc = c(10, 12, 20, 22))
predicted_scores <-
predict(quadratic_regression, newdata = new_data)
predicted_scores
# E
# Calculate correlations
between testscr, avginc, and other variables
correlations <-
cor(data[, c("testscr", "avginc", "str",
"el_pct", "calw_pct", "meal_pct")])
correlations
# f
# Least squares
regression of testscr on avginc, avginc^2, str, el_pct, calw_pct, and meal_pct
multiple_regression <-
lm(testscr ~ avginc + I(avginc^2) + str + el_pct + calw_pct + meal_pct, data =
data)
# Report estimated
regression
summary(multiple_regression)
joint_significance <-
summary(multiple_regression)$fstatistic
joint_significance








0 Comments