Article | May 20, 2025

Illuminating the Dark Proteome

Implications for discovery, disease understanding, and therapeutic development

For decades, scientists have worked under the assumption that genes encode proteins through open reading frames (ORFs) – long sequences of DNA that begin with a canonical start codon (typically “AUG”) and continue uninterrupted until reaching a stop codon. These canonical ORFs have long served as the blueprint for how we define a protein and, in turn, how we approach everything from disease modeling to therapeutic development. Perturbations in canonical proteins can often be linked back to mutations in one of the approximately 20,000 genes in the human genome, giving insight into disease pathogenesis and providing predictable drug targets.

Recent advances in proteomic technologies and large-scale analyses, however, have made it clear that the human proteome is far deeper and more complex than initially understood. Beyond the canonical proteome lies an uncharted layer of biology known as the “dark proteome”, and it may hold critical insights into disease mechanisms that have long eluded scientific understanding.

What is the dark proteome?

The dark proteome refers to the vast and largely unexplored region of the proteome comprised of “non-canonical” proteins, or proteins that do not follow the traditional gene annotation rules. In fact, there are thousands of unconventional ORFs in the human genome that do not meet the requirements considered of a ‘protein-coding gene’, yet they produce proteins all the same. This includes miniproteins (50-100 amino acids) and microproteins (<50 amino acids) derived from non-canonical open reading frames (ncORFs), splice variants, frameshifted isoforms, mutant proteins, and proteins bearing atypical post-translational modifications (PTMs).

dark proteome protein variants

The dark proteome is comprised of non-canonical proteins derived from diverse DNA, RNA, and protein variants.

Until recently, these proteins had been largely considered molecular noise. Their story mirrors that of non-coding RNAs, such as siRNA, miRNA, and lncRNA, which were also once thought to be transcriptional artifacts. Only later did scientists realize that they play essential roles in regulating gene expression and cellular behavior.

Similarly, mounting evidence suggests that dark proteins also play critical roles in fundamental biological processes, including the manifestation and progression of diseases like cancer.

A shifting perspective, driven by evolving technology

For much of proteomics’ history, researchers relied on affinity-based methods that use antibodies or binding agents to capture and quantify proteins of interest. These tools were, and still are, invaluable for detecting known proteins with relatively high sensitivity and specificity. However, their major limitation was in their inherent bias—they can only detect the specific protein or protein isoform they are designed to recognize.

This bias was further compounded by a dependence on reference databases. Even with unbiased approaches such as mass spectrometry, datasets were typically analyzed against curated protein databases built from canonical gene annotations. Protein entries are included based on evidence that the protein’s peptide sequence matches the sequence predicted from a specific gene. Data that did not align with those entries would often be filtered out as artifacts, creating a blind spot in our understanding of the broader proteome.

As mass spectrometry technology advanced and gained broader adoption within the scientific community, researchers made a surprising observation: only about 80-90% of the proteins predicted by the Human Genome Project could be detected with experimental evidence. This discrepancy begged the question about what insights may be hiding in the proteomic data that was previously dismissed as noise – collected but never annotated. A single gene can give rise to multiple, functionally distinct protein isoforms, shaped by mechanisms such as splicing and PTMs, which 1) may not be represented in conventional protein reference databases, and 2) are not all equally detectable with standard bioanalytical methods.

Subsequent research has revealed that many of these “missing” proteins are not necessarily absent, but rather have eluded detection because existing tools and reference frameworks were not designed to identify them. Moreover, it appears that the human proteome may contain many more functional proteins than initially predicted.

With this fresh perspective, scientists have since been working to unearth the intricacies of the dark proteome. Leveraging advanced technologies like mass spectrometry and ribosome profiling (ribo-seq), researchers have already uncovered thousands of non-canonical open reading frames (ncORFs) that translate into peptides. Research is also beginning to show their potential role in cancer and other diseases such as metabolic conditions, neurodegeneration, and immune dysfunction. For example, a 2024 study published in Molecular Cell identified numerous ncORFs in medulloblastoma, a form of childhood brain cancer. These ncORFs were not only translated into peptides but appeared to support cancer cell survival, suggesting they may serve as oncogenic drivers or therapeutic vulnerabilities. In another study, a microprotein called MP31 was shown to disrupt cancerous mitochondrial homeostasis and sensitize glioblastoma cells to current chemotherapy.

As researchers forge ahead in this exciting new field, mass spectrometry will undoubtedly remain a cornerstone technology for uncovering and characterizing dark proteins, offering unmatched capabilities in unbiased detection of both canonical and non-canonical proteins using direct peptide sequencing to deliver high-confidence identifications including for proteoforms and PTMs. Emerging techniques are also expanding mass spectrometry’s reach. For example, nanoparticle enrichment uses engineered particles to bind and concentrate low-abundance proteins, considerably enhancing sensitivity. Other complementary approaches like improved sample preparation methods, AI-powered data analysis algorithms, and transcriptomic data integration are further setting the stage for an unprecedented era of discovery within the non-canonical human proteome.

dark proteome profiling

Proteomics data generated via mass spectrometry supports the identification of novel protein sequences.

A wealth of opportunity in the dark proteome ahead

While much remains to be uncovered, it is clear that the dark proteome represents a critical, underexplored frontier in understanding and potentially treating disease.

At Sapient, we’re excited to be at the forefront of this movement. By combining our next-generation mass spectrometry technologies with the deep biocomputational expertise of our data science team, we offer dark proteome services that enable broader and deeper exploration of the human proteome to include the non-canonical proteins that may serve as novel biomarkers, mechanistic drivers, or therapeutic targets across a range of diseases.

What hidden proteins might be influencing your study system? What insights are buried in your data, waiting to be discovered? Whether you’re working on early disease detection, investigating new targets, or elucidating new disease pathways that can be therapeutically modulated, the dark proteome offers a vast, untapped landscape of opportunity.