By Navin Vembar
A few days ago, Google released its COVID-19 Public Forecasts, which leverages a machine learning approach to predict COVID-19’s impact for the next 14 days. They also dropped a paper about how they approached predictions with some granular categories and down to the county level. This is a pretty fantastic release of public information that can be useful for government and public health officials, and we wanted to call out a few points of interest.
- What impacts are they predicting?
- What are their data inputs?
- Interesting facets of the approach?
What Impacts Are They Predicting?
First, it should be noted that Google did a nice bit of work in their User Guide to the data, so I won’t go into the individual fields. But they are really looking at hospitalizations, the number of people in the ICU, number on ventilators, cases, and deaths. All of this allows states, counties, and hospitals to capacity plan for each scenario.
The really interesting thing they’ve done is make a fairly explicable model out of a complex environment. Explicable is really important—the application of machine learning, especially neural networks, can often create results that are accurate but not easily explainable. In Google’s white paper detailing their modeling approach, they describe a flow between each of these conditions (a case, in the ICU, on the ventilator, deceased, recovered). I found that completely fascinating—for other nerds like me, it’s a state machine with probabilistic transitions. This is an augmentation of a standard epidemiological SEIR model that uses machine learning to predict the transition probabilities. SEIR stands for “Susceptible-Exposed-Infectious-Recovered” and describes the different states that a person goes through when infected for long-incubating diseases, like COVID-19. Google expanded the model by including cases that are not documented (that is, people who have the disease but don’t know) and accounting for those different disease states that occur within the hospitalization period. Their model first predicts the parameters for those probabilities and then uses that definition to determine the outcomes on a given day in the future.
Because of the explicit nature of their modeling approach, those parameters are well-defined, have meaning, and can be used in other applications. For example, a side effect of this work is that there is a prediction for R_e, the effective reproductive rate of COVID, a vital metric for understanding at the population level when an area is safe.
It should be noted the predictions currently extend two weeks into the future, as well. In a lot of ways, two weeks for purely assessing case and death information doesn’t require these mechanisms for significant accuracy, but the results are certainly improved by adding in other data and the different categories (like people on ventilators) are extremely useful.
What Are Their Data Inputs?
Google used some mobility information, air quality data, base information about cases, deaths and hospitalization as available, and underlying census information, all of which are publicly available data sets. We’d note that they use a relatively naive network graph of counties to underpin their mobility, whereas Camber Systems’ mobility data includes transition matrices which are explicit (but private) about how many people move between two counties.
Interesting Facets of Approach
By keeping their inputs relatively simple, we again see a useful approach to making the results explainable. But there’s another place this shines—accounting for inequities.
A big chunk of the paper is actually about measures of fairness. That is, given that BIPOC are more severely impacted by COVID-19 in the US, predictions which don’t account for that are inaccurate and could potentially lead to further exacerbation of inequitable care if used by people who can make decisions about where resources should be targeted. The work laid out here is explicit and makes some comparisons to other studies in this domain. All of that makes for better, more accurate predictions.
Doing this gives not only Google, but others, an understanding of approaches to account for system inequities. The granular nature of the categories also means that planning can be targeted in the near future, which means that allocation will account for those underlying societal concerns.
It’s exciting to see this work. It avoids some of the flaws that machine learning approaches often have when it comes to opacity or being able to infer secondary information, like reproduction rates.
Read more about how mobility data can be used for the public good here.