Article | October 28, 2025
Mind the Gaps: The Real Impact of Missing Data and Outliers in Biocomputational Analysis
Authored by Shreya Gaddam, Data Scientist II and Cameron Lamoureux, PhD, Senior Data Scientist, Sapient
When it comes to data analytics for drug development, what is missing and what is extreme can completely change the story. Whether analyzing biomarker data or clinical trial results, missing values and outliers have the potential to skew insights, dilute impact, and mislead decision-making.
Missing data is a persistent data analysis challenge which stems from a variety of complexities: unrecorded measurements, clinical trial participant drop-outs, or incomplete survey responses. Similarly, outliers – extreme values that deviate markedly from the rest of the data – may occur due to various inherent or external factors. In some cases, missingness and outliers reflect genuine characteristics of the data rather than anomalies or errors. However, if not handled properly, they can introduce systematic bias, reduce statistical power, and, sometimes, completely alter the conclusions drawn from biocomputational analyses.
Impact of Missingness on Linear Regression
In LC-MS data, missingness can occur due to variation in sample preparation, instrument fluctuation, or peak mis-alignment, which are considered missing completely at random, or when low-abundance compounds fall below the limit-of-detection (LoD). Both linear regression and correlation analysis rely on complete, non-missing observations. Excluding samples with missing values can lead to loss of statistical power due to reduced sample size; it may also introduce bias if the missingness is not at random.
Imputation provides a method to rescue missing data by filling in missing data values using informed estimates rather than simply discarding incomplete records. This technique helps to preserve sample size and statistical power, if assumptions about the source of missingness are carefully considered. For LC-MS data, missing intensity values should be imputed at or below the minimum observed non-missing intensity value, as missingness is largely due to certain molecules falling below the LoD of the instrument.
When proteins or metabolites have a high percentage of missingness (>75%), presence-absence analysis can be used as an alternative approach, focusing on whether a protein/metabolite is detected or not rather than on its quantitative abundance.
To illustrate these concepts, Figure 1 shows the regression lines of a protein against a clinical outcome. The teal line shows the observed relationship, while the other two lines show the effect of median imputation and LoD imputation using the minimum of non-missing values. This comparison demonstrates how the imputation can impact regression coefficients. While these changes are sometimes subtle on a per-protein/metabolite basis, these subtle differences compounded across thousands of measured proteins/metabolites can meaningfully alter the conclusions of a large-scale proteomics or metabolomics study.
Figure 1. Demonstrating how imputation can impact regression coefficients.
Impact of Outliers on Linear Regression
Outliers may reflect technical artifacts or rare biological variations. Regardless of source, these outliers increase variance in data and inflate standard errors, reducing statistical power. Additionally, they can disproportionately skew regression coefficients, leading to over- or under-estimation of effects. These consequences can result in misleading interpretations of the relationships in data. Figure 2 illustrates this effect, showing how the presence of outliers can dramatically distort the regression line, pulling it away from the likely underlying relationship.
Figure 2. Demonstrating how the presence of outliers can dramatically distort the regression line.
Therefore, identifying outliers is a crucial step before and during statistical analysis. Visual inspection with boxplots, scatterplots or techniques like principal component analysis (PCA) can help identify and remove outliers before analysis. Leverage and influence measures derived from the linear regression hat matrix can help detect observations that disproportionately affect regression results. A trend that does not survive the removal of high leverage outlier data points may be spurious.
Impact of Missingness and Outliers on Correlation
Correlation analysis is also affected by both missingness and outliers. Pearson correlation – a linear method – is affected in similar ways to linear regression. Spearman correlation – a monotonic, rank-based method – is often a more robust choice in the presence of outliers. However, Spearman has its own disadvantages when applied to data with missingness. Specifically, when below-LoD values are imputed, they can introduce random rank differences that Spearman correlation overemphasizes, leading to inflated estimates of correlation between imputed features with higher missingness. Therefore, it is critical when performing Spearman correlation analysis (or any rank-based analysis) to avoid using imputed data; in this particular case, considering only non-missing data is the more robust approach.
Turning Data Challenges into Analytical Strengths
Missing data and outliers are more than technical hurdles in biocomputational analysis. Left unaddressed, they can obscure true signals, distort statistical estimates, and undermine the reliability of insights generated from omics datasets to inform drug development and biomarker research. However, with thoughtful analytical strategies – such as targeted imputation, presence-absence analysis, and robust outlier detection – we can better manage missingness and outliers to recognize their impact, address them with rigor, and ensure that the story the data tells is both truthful and actionable.