R studio Project Questions:


Project Task:

Dataset caschool.csv contains data on test performance, school character- istics and student demographic backgrounds. In this project, your goal is to understand the relationship between test scores and average income across districts.

(a) Load the caschool.csv data set in R. Plot a scatterplot of test scores (testscr) against average income (avginc). Calculate the correlation between the two variables.

(b) Estimate a simple regression of testscr on avginc. Report the es- timated regression, including standard errors of estimated coefficients and measures of fit. Are the coefficients significant? Interpret the co- efficients of the regression. Does the intercept in this regression have a meaningful real life interpretation?

(c) By looking at the scatterplot, do you think that the assumption of linear relationship between testscr and avginc is appropriate? Estimate a quadratic regression of testscr on avginc. Report the estimated regression and test for significance of the coefficients. Interpret the coefficient on avgine.

(d) Is the regression in part (c) better than the regression in part (b)? Sup- port your answer with a formal test procedure. Using the preferred model, predict the increase in test scores for two cases: when the av erage income increases from 10 to 12 and when it increases from 20 to 22. Compare the two predicted values and comment.

(e) Many factors that may determine test scores are omitted in regressions in parts (b) and (c). Might this be a problem, given that you are only interested in relationship between testser and avgine? Discuss. You may, if you wish, support your discussion by calculating correlations between suitable pairs of variables.

(f) Run a least squares regression of testscr on avginc, avginc 2, str, el_pct, calw_pct and meal pct. Report the estimated regression. Are the coefficients of the latter four variables jointly significant? Has the value of estimated coefficients on avgine and avgine 2 changed compared to regression in part (c)? What does this indicate? Relate your answers to your discussion in part (e).

(g) Summarize briefly what you learned about the relationship of testscr and avginc.


Project 2:

a)




This positive value between 0 and 1 indicates there is a strong positive linear relationship between average income (data$avginc) and test scores (data$testscr) in the data. Specifically, as average income values increase, the corresponding test score values also tend to increase linearly. A correlation coefficient of 0.7124308 represents a strong positive association between higher average income and higher test scores.

The output value of 0.7124308 indicates a strong positive association, where higher average income is linearly related to higher test scores. This correlation measure provides insight into the type and strength of the relationship between these two variables in the data.

B)



The estimated intercept of 625.3836 suggests the expected value of testscr when avginc is zero. However, this does not have a practical real-life interpretation as avginc cannot be zero or negative. The estimated slope coefficient of 1.8785 indicates that for every one unit increase in avginc, the value of testscr increases by 1.8785 units on average, holding all else constant.

The standard errors of the estimates allow us to check if the coefficients are statistically significant. The t-values and extremely low p-values show that both coefficients are highly statistically significant. Measures of fit like Residual Standard Error and R-squared values indicate that the linear regression model fits the data well. Approximately 50.76% of the variation in testscr is explained by the model.

Overall, this linear regression models the relationship between average income and test scores. Both the intercept and slope coefficients are statistically significant. The positive slope shows that higher income is associated with higher test scores on average based on this data.

c)


First, a scatterplot of the data is created. Upon visual inspection, the scatterplot suggests that while there is an overall positive relationship between avginc and testscr, the trend may not be perfectly linear. By examining the scatterplot of testscr and avginc, the assumption of a linear relationship between the two variables may not be entirely appropriate. There does seem to be some curvature in the pattern of points that could indicate a quadratic relationship better describes the data.




To test this, a quadratic regression model was estimated with testscr as the dependent variable and avginc and its square (avginc^2) as independent variables.

The results show that both avginc and avginc2 are highly statistically significant based on very small reported p-values. This suggests that including the quadratic term improves the fit of the model compared to a simple linear regression.

The positive coefficient on avginc indicates that test scores increase with income, but the negative coefficient on avginc2 reveals that this relationship levels off at higher income levels as the quadratic term dampens the effect of additional income increases or we can explain as the negative coefficient for avginc2 suggests that while testscr increases with avginc, it does so at a declining rate (as avginc increases further). Compared to the linear regression, this quadratic model explains slightly more variation in test scores as evidenced by the higher adjusted R-squared of 0.554. The F-statistic further confirms the overall regression fit is significantly better than the intercept-only model.

 

d)




To determine which model (linear or quadratic) is a better fit for the relationship between testscr and avginc, an ANOVA was performed to formally compare the two nested models. The ANOVA results show the quadratic model provides a significantly better fit compared to the linear model based on the extremely small p-value. This suggests allowing for a nonlinear term via the quadratic specification improves the fit over the simple linear regression.

Using the preferred quadratic model, we can predict the change in test scores for two example increases in average income. When income increases from 10 to 12, the predicted increase in test scores is 5.8405 points (647.4213 - 641.5808). For an increase from 20 to 22, the predicted increase is smaller at 4.2181 points (671.5463 - 667.3982). This makes intuitive sense given the quadratic nature of the relationship - increases have diminishing returns at higher income levels as the effect levels off. Overall, the quadratic model statistically fits the data significantly better based on the ANOVA results. And its predictions of smaller test score increases at higher income levels aligns with the quadratic functional form, providing face validity for the preferred specified relationship between the variables in this dataset.

 

e)



Yes, omitting other relevant factors that influence test scores could potentially be a problem for the regressions in parts (b) and (c). While they successfully model the relationship between test scores and average income, other variables excluded from the model may partially or fully explain some of the observed variation in scores.

For example, calculating correlations shows test scores are more strongly correlated with percentage of students eligible for free/reduced lunch (el_pct), percentage California standardized test scores (calw_pct), and percentage receiving meal subsidies (meal_pct) than with average income alone.

This suggests factors like socioeconomic status, prior achievement, and nutrition likely also impact test performance. Excluding them means the regressions may overstate the influence of income if other factors are not held equal. The analysis so far only focused on isolating the income-test score relationship. But in reality, multiple factors simultaneously influence outcomes. By omitting relevant predictors, there is a possibility of omitted variable bias - some of the effect currently attributed to income alone may actually be due to other excluded factors. To get a fuller picture of determinants, a multiple regression including measures of SES, prior scores, nutrition etc. alongside income could provide a more accurate model of varied influences on achievement.

 

f)


 



In this multiple regression model, testscr is regressed on avginc, avginc^2, str, el_pct, calw_pct, and meal_pct. The coefficients for el_pct, str, and meal_pct are statistically significant based on their reported p-values. However, the coefficients for avginc, avginc^2, and calw_pct are not significant. A joint F-test shows that the coefficients for the latter four variables (str, el_pct, calw_pct, meal_pct) are highly statistically significant together.

Compared to the quadratic regression in part (c), the estimated coefficients on avginc and avginc^2 have changed and are no longer statistically significant at 1% significance level. This suggests that once socioeconomic covariates are controlled for, the apparent relationship between test scores and average income alone is weakened and no longer holds.

As discussed in part (e), omitting these explanatory variables could lead to overstating the effect of income alone due to potential omitted variable bias. The multiple regression addresses this limitation by controlling for socioeconomic influences, prior achievement and nutrition simultaneously. This provides a more accurate and comprehensive understanding of determinants of test scores compared to the bivariate regressions. The results support the argument that a simplistic bivariate analysis could mischaracterize the relationship.

 

g)

Initially, a simple linear regression suggested a positive statistically significant relationship between test scores and average income, with higher income associated with higher scores. However, this model did not control for any other factors. Examining a scatterplot revealed the relationship may be better characterized as quadratic. Estimating a quadratic regression showed income has diminishing returns on scores at higher levels, and provided a better statistical fit. However, correlation analysis illustrated other variables like socioeconomic status are also related to both test scores and income. Failing to account for these could bias the apparent effect of income found in earlier regressions.

Indeed, when a multiple regression controlled for variables like free lunch rates, regional differences, and subsidies, the estimated effects of income were no longer statistically significant. This indicates the simple bivariate relationship was misleading once confounding was addressed. Overall, while higher income was initially linked to higher test scores, the true relationship is more complex as other socioeconomic factors play a role. Average income alone does not fully determine test performance. A multiple regression approach provided the clearest picture of how income impacts scores after controlling for relevant covariates.

 


 

R codes:

# A

# Load the caschool.csv dataset

data <- read.csv("caschool.csv")

 

# Plot scatterplot

plot(data$avginc, data$testscr, xlab = "Average Income", ylab = "Test Scores")

 

# Calculate correlation

correlation <- cor(data$avginc, data$testscr)

correlation

# B

# Simple regression of testscr on avginc

regression <- lm(testscr ~ avginc, data = data)

 

# Report estimated regression

summary(regression)

 

# C

# Scatterplot

plot(data$avginc, data$testscr, xlab = "Average Income", ylab = "Test Scores")

 

# Quadratic regression of testscr on avginc

quadratic_regression <- lm(testscr ~ avginc + I(avginc^2), data = data)

 

# Report estimated regression

summary(quadratic_regression)

 

# D

# Comparing the two models using ANOVA

anova_result <- anova(regression, quadratic_regression)

anova_result

 

# Predicting increase in test scores

new_data <- data.frame(avginc = c(10, 12, 20, 22))

predicted_scores <- predict(quadratic_regression, newdata = new_data)

predicted_scores

 

# E

# Calculate correlations between testscr, avginc, and other variables

correlations <- cor(data[, c("testscr", "avginc", "str", "el_pct", "calw_pct", "meal_pct")])

correlations

 

# f

 

# Least squares regression of testscr on avginc, avginc^2, str, el_pct, calw_pct, and meal_pct

multiple_regression <- lm(testscr ~ avginc + I(avginc^2) + str + el_pct + calw_pct + meal_pct, data = data)

 

# Report estimated regression

summary(multiple_regression)

joint_significance <- summary(multiple_regression)$fstatistic

joint_significance