Gene Prioritization with Word Embeddings and Protein Networks

Prioritizing candidate genes in ALS by integrating genetics with word embeddings.

Word Embeddings Bioinformatics NLP Protein Network Analysis

ALS gene prioritization overview

🧬 Gene Prioritization with Word Embeddings and Protein Networks

📅 Research Period: Jul 2025 – Jul 2026

This research is supported by a FAPESP undergraduate fellowship (grant link).

Overview

Amyotrophic Lateral Sclerosis (ALS) is a fatal and highly heterogeneous neurodegenerative disease with multiple genetic and molecular contributors. Although GWAS and sequencing studies have identified important ALS-associated loci and genes, translating these signals into clear causal gene hypotheses remains a major challenge.

This project investigates whether biomedical literature embeddings can provide functional context for ALS gene prioritization. Instead of treating text-derived representations as standalone predictors, the project combines them with tissue-expression features from the Human Protein Atlas (HPA) and compares the resulting functional signal with GWAS-derived evidence.

The main finding is that literature-derived embeddings did not consistently improve locus-level gene ranking, but the Word2Vec embedding space showed biologically interpretable structure. In protein network analyses, the Word2Vec/HPA model converged with GWAS-derived evidence at the level of protein interaction modules, suggesting that text-derived representations may be most useful for systems-level biological interpretation.

Method at a Glance

ALS gene prioritization and network analysis pipeline

At a high level, the workflow consists of five stages:

Literature collection: We construct disease-focused PubMed corpora for neurodegenerative and motor neuron disease contexts using ontology-guided search terms.
Gene representation: Each gene is represented using literature-derived embeddings, with Word2Vec used as the main interpretable embedding model and PubMedBERT explored as a complementary contextual representation.
Feature integration: Gene embeddings are reduced with PCA and combined with HPA brain and muscle expression features to build a functional gene-prioritization model.
Locus-level evaluation: Logistic-regression models are evaluated on ALS-associated GWAS loci to test whether text-derived features improve candidate-gene ranking.
Network interpretation: Functional scores from the Word2Vec/HPA model are compared with GWAS-derived evidence through protein-network propagation, module enrichment, and Gene Ontology analysis.

ALS network analysis output