Vaccines are among the most powerful weapons we have for preventing infectious disease. In the 1950s, hundreds of thousands of Americans were infected by measles every year. But by 2015, after decades of vaccination, a mere 191 cases were reported. Unfortunately, most vaccines take years to develop, and in the midst of a pandemic, society can’t wait. One promising approach to accelerate this process is to use machine learning, a form of artificial intelligence, to guide vaccine design.
What does it mean to design a vaccine? Vaccines work by exposing you to parts of a pathogen with the aim that your immune system will more easily recognize it in the future, mounting a quicker and more robust response. The oldest forms of vaccines were composed of dead viruses that are relatively safe but sometimes ineffective or live, weakened viruses that pose greater safety risks. More recent vaccines tend to contain specific components of a virus (such as the surface protein for hepatitis B vaccines) that are judged to be safe and effective. Future vaccines might even include specific viral protein fragments. Regardless of the way in which a vaccine is composed, the design goal is always to include viral components that are highly immunogenic: visible to your immune system and eliciting an immune response.
In recent years, researchers in immunology and machine learning have studied and modeled many of the properties of viruses that make them immunogenic. One key property is what parts of a virus can be targeted by antibodies, proteins produced by B-cells that can prevent viral entry into cells and inhibit the spread of a virus throughout your body. Another key property is what viral protein fragments will be presented on a human cell’s surface, marking a cell as infected so that it can be killed by T-cells. We and other researchers have trained machine learning models to make predictions about the strength of these properties for any viral fragment. Using such models, we can better choose what parts of a virus are most likely to be immunogenic and should be included in a vaccine.
Machine learning models learn to recognize patterns from a large number of training examples, often in ways that humans would have a very difficult time replicating. For example, immunologists have identified nearly one million protein fragments that are presented on a cell’s surface and visible to T-cells. However, no human eyes would be able to tell you whether this is true of SYGFQPTNGVGYQPY, a fragment from the novel coronavirus. On the other hand, a machine learning model can learn to answer this question from those million other examples, building an understanding of what patterns among the letters representing amino acids lead to a high likelihood of presentation. Last year, we published a model in the journal Nature Biotechnology dubbed MARIA that is trained to make these kinds of predictions. Many research labs have created similar models that can be applied to other kinds of immune response, including antibody binding.
As COVID-19 began to spread globally in late January, we used several of these machine learning tools to search for immunogenic components of the virus that would make good vaccine candidates. We scanned each viral protein from SARS-CoV-2, the virus that causes COVID-19, to identify regions of the virus with strong antibody targets and a high likelihood of cell presentation. We were immediately struck by the fact that the SARS-CoV-2 spike protein can be targeted by antibodies, as other researchers had begun speculating that the spike protein was essential for viral entry into lung cells. We further identified hundreds of viral protein fragments presentable on human cells. The fragment we mentioned earlier, SYGFQPTNGVGYQPY, is potentially both an antibody target and presentable. In our preprint online, we provided our top immunogenic candidates besides this fragment. Taken as a whole, our work suggests it is possible to develop an effective vaccine against COVID-19 and provides some preliminary guidance for doing so.
We have discussed our findings with several companies designing COVID-19 vaccines, and our results are good news for the vaccines currently under development. The most promising candidate protein fragments we discovered are located on the spike protein, which is currently the main focus of several vaccine developers such as NIH and CanSino. Other researchers have confirmed that antibodies from patients who have recovered from COVID-19 bind to regions we predicted on the spike protein. It will still be 12-18 months, however, before we can confirm the clinical efficacy of any COVID-19 vaccines.
It is still early in the era of applying machine learning to vaccine design. A machine learning model is only as good as its training data, and current immunology models are trained on much smaller datasets compared to models for voice recognition or facial recognition, areas in which artificial intelligence has excelled. Generating larger and more diverse datasets will improve the reliability of immunology models, and their potential impact on the field is enormous. If we can be more confident in the efficacy of vaccines even before they have gone to trials, we can dramatically accelerate what is otherwise a very slow feedback loop.