An Inquiry Into Student Attendance Rates

Kevin Macias-Matsuura
5 min readJul 8, 2020

As a former NYC public school teacher I can say from personal experience that lack of attendance can be a tricky dilemma to solve. Some students’ guardians work night shifts and get to bed right before a student is about to wake up, or they may leave to work before a student even wakes up to get ready. Some students may even be in a purely neglectful situation, may be living in a shelter, or may live far away and have difficulty finding transportation. With a gamut of challenges to get students in their seats it is troubling to know many schools get state funding based on students in attendance for the day. There is also a corpus of studies and literature that suggest time spent in the classroom is connected to educational outcomes. Using NYC educational data, I sought to discover which variables, in the NYC School Quality Report (SQR) for the 2018–19 school year, most affect student attendance rates for NYC’s elementary and middle public schools.

The Data

The data used to conduct my analysis was the NYC School Quality Report (SQR). This yearly report provides a survey of various aspects of NYC schools from qualitative indicators, like a sliding a scale for effective school leadership, to quantitative indicators like percentage of population in temporary housing. In my data cleaning process I eliminated charter schools because many choose not to be evaluated by the SQR and instead are evaluated by means authorized by the Chancellor, State University of New York Trustees, and/or by the NY State Education Board of Regents. The vast majority of my parameters are in percentages so a linear regression model was used to analyze the independent variables’ relationships to the target variable of attendance rates. Some variables such as majority white/asian/black/hispanic and upper 75% of economic need index were feature engineered to see their impact on attendance rates.

Questions & EDA (Exploratory Data Analysis)

I began my EDA with the following questions in mind:

  • Is there a statistically significant difference between teachers with 3 or more years of experience and student attendance rates?
  • Does the Economic Need Index have a statistically significant impact on the attendance rates of NYC public schools?
  • Is there a strong negative correlation between schools with a high percentage of chronic absences and attendance rates?

I found that teachers with 3 or more years of experience and students on the Economic Need Index both have a statistically significant difference from student attendance rates. I also confirmed that there is a strong negative correlation between schools with a high percentage of chronic absences and attendance rates. Through further EDA I found that many of the questionnaire features were highly correlated with each other so they needed to be dropped along with HRA eligible as it was too correlated with Economic Need Index (both are measures of needed economic assistance). Through EDA I was also able to note a couple outliers in the target variable. On further investigation I was able to see that the two outliers were schools transitioning to closing. I set a cap to remove the outliers which provided me with a more normal distribution as shown in the charts below.

I found another interesting point when I charted the 4 racial categories given in the School Quality Review. From the distributions in the graphs below we can see that for Asian, Black, and White students there is a positive skew indicating that there are many schools with small percentages of these students. This also shows that there are a fair number of schools with highly concentrated populations of Black and Hispanic students (percentage of Hispanic students is more evenly distributed in general). This lends to the idea that NYC schools tend to be very racially segregated.

Modeling Results

I ran an initial model without much cleaning or adjusting to get a very rough baseline. That model contained 28 features, had a training RMSE of 0.000908, a testing RMSE of 0.015365, and an adjusted R² of 0.945. The high adjusted R² along with the disparate RMSE results indicated that the model was probably overfitting, so I removed features with high p-values for the next iteration. The second iteration was whittled down to 15 features. The low but more even RMSE values and the high adjusted R2 (0.966) meant that the values didn’t have much of an effect on the target variable. The coefficient values were very low except for chronically absent which appeared to have a strong negative effect on the target variable of -18%. After a couple more iterations The final model I settled upon is the second iteration minus the teacher collaboration feature. It was a matter of removing high p-values and and limiting the number of features since many of my features did not seem to have much of an effect. The three features that had the biggest impact on the dependent variable were chronic absences, Asian, and Hispanic. Asian, and Hispanic students had a small impact of +2.5% and +1.6% respectively. Chronically absent had the largest effect driving down attendance rates by -18.9%.

Takeaways

While I was able to prove that chronic absences are a strong factor in driving attendance down the final model itself had fit but low interpretability due to the vast majority of the features having little impact on the target. Many of the factors, like temporary housing and teacher collaboration, that I assumed would be statistically impactful ended up not having much of an impact at all. While I was able to elicit some insights from the data the model itself is not at all at production level. I would recommend that, because money is allocated by student attendance, there should implementation of additional questions of the SQR to give a model like this more predictive power. For example one may ask, “How many times a month are you in contact with the school and/or teachers?” or “How many times a month are you in the school building?” as a measurement of strong family/community ties.

--

--

Kevin Macias-Matsuura

Former English teacher turned Data Scientist/Analyst interested in data, design, and storytelling.