Creating a Data Table to Explore Factors that may Correlate with COVID-19 Cases at U.S. Colleges

Snippet of the data table which is contains 1765 rows and 55 columns from over 10 different sources.

In the fall, I was tasked to create a novel, usable and relevant dataset as part of a competition in Duke Fuqua’s MQM Program. I constructed a data table that could be used to obtain insights about factors that may have influenced the number of COVID-19 cases at universities across the U.S. during the fall semester. (Or to attempt to identify which colleges did a better job at limiting spread of the virus.) I was motivated to help students, their families and policymakers uncover which schools may pose the greatest risks (at least for the spring semester) and for many interested stakeholders to understand why certain institutions may have been in a better position to succeed than others.

When the pandemic arrived in the U.S. last March, I was living near its epicenter in the New York City area. I was already looking toward the summer when I could come to North Carolina, where any outbreaks were few and far between. I hypothesized that the spread of the virus was mostly related to population density (and it was potentially exacerbated by the cold of the northeast). Come the middle of the summer, New York had its positivity rate under control, while the south was reaching new peaks. It seemed clear that there were not merely one or two factors accounting for the spread of the virus in various locations. Information about this pandemic has changed as rapidly as the virus has spread throughout the last year and capturing data has been the driving force behind this.

However, I have not found datasets that specifically concern COVID-19 cases at colleges and also contain an extensive list of variables that may impact the accumulation of these cases. Students, university administrators and staff, local government officials, the NCAA and surrounding community members would also likely have great interest in determining the relative Covid-19 risk for a given college.

Of course, the policies that each university implements (whether that be regarding in-person classes, living on campus, limiting gatherings, wearing masks, etc.) will have a large impact on the number of positive cases. But for this data table I focused on elements that universities are not going to suddenly adjust. It is also difficult to collect data on each school’s individual policies. So I particularly focused on finding data related to the characteristics of each institution and its corresponding county.

Constructing the Dataset

I used the New York Times’ data on Covid-19 cases at U.S. colleges as my starting point. (See the full list of data sources at the bottom of this post.) From there, I joined information from another table in that dataset that contained the number of cases for each county. I searched for more datasets with relevant university and county data. In total, I accumulated 55 variables from 13 different sources in my final data table.

I used the dplyr package in R to left join each additional table on to my existing data table. Using left joins ensured that I would not lose any records in my table. I mostly joined the tables on ipeds_id (a unique code for each college) and FIPS (a unique code for each county). However, I had to join the first two tables using county and state (which required a lot of revision), I had to later join two tables on college name, and I made two joins on state names. After every step in the process, I made sure my data was rational and this allowed me to catch a few mistakes either in the data itself or with incorrect selections or joins. In total, there are 1,765 college records in the final data table.


I manually inputted two of my variables: fall2020_football and ugrads_fully_online. This is because I could not find publicly available datasets that included information about schools that are playing football in fall 2020 and schools that are conducting all undergraduate classes remotely. I believe playing football is a signal for both general athletic activity and the prioritizing of athletic activity at a college, which both could contribute to clusters among athletes and thus a rise in cases on campus. (It could also increase the likelihood for gatherings at or around games.) Meanwhile, ugrads_fully_online is the only policy-oriented variable, but I felt it had to be included because it is obvious that operating remotely severely restricts spreading the virus on campus (as seen by the lower case counts at these schools).

I included undergraduate enrollment data because the undergraduates are most likely to live and intermingle on campus and a higher on-campus population density should increase the risk of spread. (And if positivity rates are equal between schools, the one with more students will have more cases.) I included demographic data at both the college and county level because the positivity rate seems to be higher among groups with lower socioeconomic status. I also included tuition/endowment/aid/admissions statistics because I hypothesize that these may be signals for the resources of each university or the average wealth of its students.

Because students tend to intermingle with the surrounding community, I think similar data about the county is also relevant. I included political data because more liberal-leaning areas have tended to take stricter precautions during the pandemic, while conservative-leaning areas have been more skeptical of public health advice. Longitude and latitude can be used to determine the impact of geographical location, and latitude specifically can be used as a proxy for temperature. Finally, I included some variables about how workers commute to work in each county. Counties which have a greater reliance on public transit may be at a greater risk for spreading the virus.

Using the Data

I have updated my dataset since the fall to include case numbers from the latest New York Times update: December 11, 2020. Thus the table paints a picture from the entire fall semester. (Many schools sent students ended their on-campus fall semesters at the start of Thanksgiving break.) Because there are only cumulative case numbers from two dates, this data cannot really be used to track rate of spread. But even though using the data was not part of this project, I ran a couple of linear regressions to see if I could gain any initial insights.

I used the December 11 college case counts as my dependent variable. Since I was using absolute case counts, undergrad enrollment was obviously the most significant factor, but since this was being accounted for in my regression, I was able to get a better sense of the importance of other variables.

A Notre Dame football celebration from this fall.

The next most significant variable (by far the lowest p-value) was the binary football variable that I created. If the school played football during the fall, that university would be expected to report an average of 691 more cases, when accounting for all other variables. Other factors that were significant and that increased the number of cases included being a private school, having a lower percentage of Asian enrollment, a higher tuition price, a lower population density and being in a county that is more reliant on public transportation for their commutes.

Aside from the latter result, these findings are not necessarily expected and can be especially confusing if you try to relate them together. But I deduce from these correlations that COVID spread was much more influenced by student behavior than by the location of the college. If students (or their parents) paid a lot to travel to a university set in a small-to-medium-sized college town, then they wanted a true college experience. And in many cases, more urban schools or those that are more known for academics than a football team, may have had less restrictive policies in place.

While, it is difficult to single out a few colleges that best limited virus spread (because many schools reported zero or very few cases), my regression did reveal a handful of schools that did poorly when taking the other variables into account.

Here are the bottom 10, which all reported between an average of 200 and 500 additional cases than would be expected:

10. Texas A&M University

9. University of Iowa

8. University of Kentucky

7. Texas Tech University

6. University of Alabama

5. Purdue University

4. Brigham Young University

3. University of Georgia

2. Clemson University

Their QB tested positive as well.

And the honor for worst school at containing COVID goes to…

  1. University of Florida

Of course, there are several limitations to performing linear regressions just like this, especially since I was not able to include school-level policy decisions. However, I do believe this dataset can still be useful well into the future. There will likely still be COVID risks to gauge in the fall, but there may also be other public health crises in the not-so-distant future. And it will never be too late to learn from this one.

Data Sources

New York Times

Data World/Chronicle of Higher Education

Sporting News

The Chronicle of Higher Education

Tuition Tracker

National Center for Educational Statistics

Opportunity Insights

Harvard Dataverse

Civil Service USA

USDA Economic Research Service

U.S. Census Bureau

And once again, here is the csv file.

Check out my github for the code behind each post.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store