Mapping Metabolic Changes for Diabetes Prediction with Machine Learning
Understanding the risk factors that give rise to human disease is essential to early detection and development of effective treatments for disease prevention. Human disease risk represents an interaction between underlying genetic predisposition, which is largely set from the moment of conception, and the varied and changing exposures that occur over an individual’s lifetime, including from diet, lifestyle, the environment, internal organs, and the microbiome.
The degree of risk that originates from genetic vs. non-genetic factors has been modeled using a variety of tools, including the study of monozygotic and dizygotic twins (Rappaport 2016). From these studies, it has been found that the majority of lifetime attributable disease risk is related to non-genetic causes, particularly for complex, heterogenous diseases such as diabetes.
Figure 1. Population attributable fractions for multiple disease phenotypes, including diabetes, estimated from studies of mono and dizygotic twins. Rappaport PLoS. 2016.
Metabolic biomarker profiling: elucidating non-genetic disease risk
Effective mapping of disease risk across populations requires measurement of non-genetic factors which are not encoded in genetic sequence alone. These non-genetic factors can be captured in part through circulating small molecule biomarkers. Small molecules can be produced endogenously in human cells, in both healthy and diseased tissue, and cross cellular and biological barriers into central circulation. Small molecules are also introduced into blood from the world around us – what we eat, drink, smell, smoke, from the microbes that inhabit our gut, from external environmental exposures, and more. These molecules circulate throughout the body and cause effects over time, making them ideal biomarkers of disease risk and early disease detection.
The mapping of small molecule chemistries, often referred to as metabolomics, is not a new concept, yet only a portion of small molecule biomarkers have been characterized to date, and in general, across relatively small populations. The challenge has been in achieving analytical scale and enabling approaches that allow us to capture data from tens of thousands of individuals.
Measuring metabolic changes at scale and over time
For the last decade, Sapient’s team has been tackling the technical challenges that have limited our ability to measure diverse small molecule biomarkers across human populations. This work has resulted in the development and optimization of our rapid liquid chromatography-mass spectrometry (rLC-MS) systems: next-generation, proprietary technologies for nontargeted mass spectrometry analysis of human biosamples. These tools enable ultra-high throughput, agnostic profiling of thousands of circulating small molecule factors across multiple chemical domains. The vast majority of the biomarkers discovered via rLC-MS are uncharacterized, though they can be accurately quantified and statistically analyzed – amplifying discovery potential to a new order of magnitude.
In contract to genomics, circulating small molecule biomarkers are highly dynamic, and can fluctuate in response to changes in internal physiology or external exposures. To limit the potential for reverse causality, we leverage the discovery potential of longitudinal studies. These approaches sample a healthy individual’s blood and then follow that person over several decades as they move to a diabetic state. In assaying the ‘pre-disease’ blood samples, we can identify specific biomarkers that reveal individuals who will go on to develop diabetes years prior to disease onset. Early studies by other groups in relatively small populations with relatively few measures have demonstrated the potential for these longitudinal discoveries (T. Wang et. al.); Sapient’s approaches aims to greatly expand upon and extend these landmark studies through the measure of tens of thousands of circulating biomarkers across tens of thousands of individuals.
A longitudinal study of tens of thousands of individuals: predicting diabetes by looking back
Figure 2 below represents an rLC-MS analysis of over 40,000 circulating small molecule factors in tens of thousands of individuals from many different studies collected from dozens of sites around the world. These individuals have diverse socioeconomic backgrounds, diets, and lifestyles, and have been followed for up to two decades years.
Figure 2. Across a study of tens of thousands of diverse individuals, we find hundreds of metabolites that associate with incident diabetes development >10 years in advance.
Through time-to-event analysis, we found hundreds of biomarkers present in ‘pre-disease’ states that predict the development of incident diabetes more than 10 years in advance. Each blue dot on the plot represents a small molecule that cross-validates across different studies and diverse populations, confirming the robust nature of the discovery and consistency in findings despite the heterogeneity of the underlying populations.
Machine learning: bridging the gap from data to insight
While it is now possible to generate large-scale small molecule biomarker data very rapidly, the ability to interpret that data for actionable insight and knowledge has not scaled at the same rate. Machine learning (ML) now enables the processing power required to unite voluminous, multi-dimensional data assets such as small molecule measures and genomics, and enable rapid, accurate classifier analysis to identify associations with incident disease development.
When we look long term at disease risk over a ten-year period, as shown in Figure 3, we find that ML-based classifiers using metabolic risk scores (MRS) have tremendous predictive power, clearly differentiating the top 25% of individuals at the highest risk of developing diabetes over a decade. It also allows us to understand how risk increases over time, and particularly how common risk factors such as obesity interact with MRS to influence disease risk.
Figure 3. ML-based classification of long-term diabetes risk and impact of BMI on MRS to influence disease risk.
ML-aided diabetes risk assessment with small molecule biomarkers
Metabolic diseases such as diabetes typically manifest with significant interpersonal variability, causing challenges for disease prediction, prevention, diagnosis, and treatment. Small molecule biomarkers can help to identify the sources of disease heterogeneity as readouts of the unique exposures an individual experiences throughout life. We now have the analytical technologies to efficiently probe this largely yet characterized chemical space in humans, and machine learning will enable us to take these large datasets and derive actionable knowledge from the information – including by identifying early biomarkers of disease.