PathLUPI: Genome-Anchored Foundation Model for Molecular Prediction from Histology Images

Cheng Jin, Fengtao Zhou, Yunfang Yu, Jiabo Ma, Yihui Wang, Yingxue Xu, Huajun Zhou, Hao Jiang, Luyang Luo, Luhui Mao, Zifan He, Xiuming Zhang, Jing Zhang, Ronald Cheong Kin Chan, Herui Yao, and Hao Chen

Abstract

Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs.

Key Results:

AUC ≥ 0.80 in 14 of the biomarker prediction and molecular subtyping tasks
C-index ≥ 0.70 in survival cohorts of 5 major cancer types

Key Features

Feature	Description
WSI-only Inference	Only requires histology images at inference time - no genomic data needed
Multi-task Support	Supports 27+ molecular subtyping and biomarker prediction tasks
Survival Prediction	Predicts patient prognosis across multiple cancer types
Foundation Model Backbone	Uses CONCH for robust WSI feature extraction
Pathway-guided Learning	Leverages 50 Hallmark pathways for cross-modal reconstruction

Installation

Requirements

Python 3.8+
PyTorch 1.12+
CUDA 11.8+ (for GPU acceleration)

Step-by-Step Installation

# 1. Clone repository
git clone https://github.com/ChengJin-git/PathLUPI.git
cd PathLUPI

# 2. Create conda environment
conda create -n pathlupi python=3.8
conda activate pathlupi

# 3. Install PyTorch (with CUDA support)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 4. Install other dependencies
pip install -r requirements.txt

# 5. Install ASlide for WSI reading (optional, for more WSI format support)
pip install git+https://github.com/MrPeterJin/ASlide.git

Verify Installation

import torch
from models.PathLUPI.network import PathLUPI

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")

# Test model instantiation
model = PathLUPI(omic_sizes=[200]*50, n_classes=2, path_size=512)
print("PathLUPI model loaded successfully!")

Quick Start

Option 1: Inference on a New WSI (Recommended for New Users)

If you have a trained model and want to predict molecular subtypes from a new WSI:

python inference.py \
    --checkpoint checkpoints/BRCA_ER_best.pth.tar \
    --wsi_features /path/to/your/wsi_features.pt \
    --task BRCA_ERSub \
    --signatures labels/signatures/hallmarks_signatures.csv

Option 2: Train a New Model

python main.py \
    --model PathLUPI \
    --task BRCA_ERSub \
    --excel_file splits/biomarker/BRCA_ER_Splits.csv \
    --label labels/biomarker/BRCA_Biomarker_Labels.csv \
    --root_path /path/to/wsi/features \
    --root_omic /path/to/transcriptomic/data \
    --signatures labels/signatures/hallmarks_signatures.csv \
    --modal WSI_Gene \
    --fold 0,1,2,3,4

Inference Guide (Predict from WSI)

This section explains how to use a trained PathLUPI model to make predictions on new WSI samples.

Prerequisites

Trained Model Checkpoint (.pth.tar file)
WSI Features (.pt file extracted using foundation models like CONCH, UNI, Virchow, etc.)
Pathway Signatures (hallmarks_signatures.csv)

Step 1: Extract Features from Your WSI

We recommend using PrePATH for feature extraction. Please refer to the PrePATH repository for detailed installation and usage instructions.

Expected output: A .pt file containing features of shape (N_patches, feature_dim)

CONCH: 512-dim features
UNI: 1024-dim features
Virchow2: 2560-dim features (1280 × 2 with CLS token)

Step 2: Run Inference

python inference.py \
    --checkpoint /path/to/trained_model.pth.tar \
    --wsi_features /path/to/features/slide_name.pt \
    --task BRCA_ERSub \
    --signatures labels/signatures/hallmarks_signatures.csv \
    --output predictions.json

Step 3: Interpret Results

For Subtyping Tasks:

{
  "predicted_class": 1,
  "predicted_label": "ER+",
  "probabilities": {
    "ER-": 0.1234,
    "ER+": 0.8766
  }
}

For Survival Prediction:

{
  "risk_score": 2.345,
  "survival_probability": [0.95, 0.85, 0.72, 0.58]
}

Inference Parameters

Parameter	Description	Required
`--checkpoint`	Path to trained model checkpoint	✅
`--wsi_features`	Path to extracted WSI features (.pt)	✅
`--task`	Prediction task (e.g., BRCA_ERSub)	✅
`--signatures`	Path to pathway signatures CSV	✅
`--output`	Output file path (JSON)	❌
`--device`	Device (cuda/cpu)	❌
`--region_num`	Number of pathway regions (default: 50)	❌

Training Guide

Training Workflow

1. Prepare Data → 2. Configure Task → 3. Train Model → 4. Evaluate

Single Task Training

python main.py \
    --model PathLUPI \
    --task BRCA_ERSub \
    --excel_file splits/biomarker/BRCA_ER_Splits.csv \
    --label labels/biomarker/BRCA_Biomarker_Labels.csv \
    --root_path /path/to/wsi/features/conch \
    --root_omic /path/to/transcriptomic/data \
    --signatures labels/signatures/hallmarks_signatures.csv \
    --modal WSI_Gene \
    --fold 0,1,2,3,4 \
    --num_epoch 30 \
    --lr 2e-4 \
    --region_num 50 \
    --loss ce_l1

Batch Training (All Tasks)

# Molecular subtyping tasks
bash scripts/subtyping_conch.sh

# Survival prediction tasks
bash scripts/survival_conch.sh

Training Parameters

Parameter	Description	Default
`--model`	Model architecture	PathLUPI
`--task`	Prediction task	-
`--fold`	Cross-validation folds	0,1,2,3,4
`--num_epoch`	Training epochs	30
`--lr`	Learning rate	2e-4
`--region_num`	Number of pathways	50
`--ratio`	Loss weight ratio (λ)	0.3
`--loss`	Loss function	ce_l1 / nll_surv

Statistical Testing

We provide standardized statistical testing methodologies for model comparison:

One-sided Wilcoxon Signed-Rank Test

Used for pairwise comparison between PathLUPI and baseline methods across multiple tasks:

from scipy.stats import wilcoxon

# Compare performance across tasks
# H0: PathLUPI performance <= baseline performance
# H1: PathLUPI performance > baseline performance
stat, p_value = wilcoxon(pathlupi_scores, baseline_scores, alternative='greater')

Bootstrap Confidence Intervals

Used for estimating uncertainty in performance metrics:

from sklearn.utils import resample
import numpy as np

def bootstrap_ci(y_true, y_pred, metric_fn, n_bootstrap=1000, ci=0.95):
    """Calculate bootstrap confidence interval for a metric."""
    scores = []
    for _ in range(n_bootstrap):
        indices = resample(range(len(y_true)), replace=True)
        score = metric_fn(y_true[indices], y_pred[indices])
        scores.append(score)

    alpha = (1 - ci) / 2
    lower = np.percentile(scores, alpha * 100)
    upper = np.percentile(scores, (1 - alpha) * 100)
    return np.mean(scores), lower, upper

Project Structure

PathLUPI/
├── main.py                     # Main training script
├── inference.py                # Inference script (WSI-only prediction)
├── requirements.txt            # Python dependencies
│
├── models/PathLUPI/            # Model implementation
│   ├── network.py              # PathLUPI architecture
│   ├── engine.py               # Training/validation engine
│   ├── util.py                 # Model utilities (attention, losses)
│   ├── wsi_embed.py            # WSI re-embedding module
│   └── rmsa.py                 # Region-based multi-head self-attention
│
├── datasets/                   # Dataset loaders
│   ├── TCGA_Subtype.py         # Subtyping dataset
│   └── TCGA_Survival.py        # Survival dataset
│
├── labels/                     # Label files
│   ├── biomarker/              # Biomarker labels (ER, HER2, BRAF, etc.)
│   ├── molecular/              # Molecular subtype labels (PAM50, CMS, etc.)
│   └── signatures/             # Pathway signatures (Hallmark pathways)
│
├── splits/                     # Train/val splits
│   ├── biomarker/              # Biomarker task splits
│   ├── molecular/              # Molecular subtyping splits
│   └── survival/               # Survival prediction splits
│
├── scripts/                    # Batch training scripts
│   ├── subtyping_conch.sh      # All subtyping tasks
│   └── survival_conch.sh       # All survival tasks
│
└── utils/                      # Utilities
    ├── options.py              # Argument parser
    ├── loss.py                 # Loss functions
    ├── optimizer.py            # Optimizer definitions
    └── scheduler.py            # Learning rate schedulers

Supported Tasks

Biomarker Prediction Tasks

Task	Cancer Type	Classes	Description
`BLCA_FGFR3Sub`	Bladder	2	FGFR3 mutation status
`BRCA_ERSub`	Breast	2	Estrogen Receptor status
`BRCA_PRSub`	Breast	2	Progesterone Receptor status
`BRCA_HER2Sub`	Breast	2	HER2 amplification status
`BRCA_TNBCSub`	Breast	2	Triple-negative classification
`BRCA_PIK3CASub`	Breast	2	PIK3CA mutation status
`BRCA_TP53Sub`	Breast	2	TP53 mutation status
`GBMLGG_IDH1Sub`	Brain	2	IDH1 mutation status
`CRC_BRAFSub`	Colorectal	2	BRAF mutation status
`CRC_KRASSub`	Colorectal	2	KRAS mutation status
`CRC_TP53Sub`	Colorectal	2	TP53 mutation status
`LUAD_EGFRSub`	Lung	2	EGFR mutation status
`LUAD_KRASSub`	Lung	2	KRAS mutation status
`LUAD_TP53Sub`	Lung	2	TP53 mutation status
`LIHC_TP53Sub`	Liver	2	TP53 mutation status
`SKCM_BRAFSub`	Melanoma	2	BRAF mutation status
`NSCLC_TMBSub`	Lung	2	Tumor Mutation Burden

Molecular Subtyping Tasks

Task	Cancer Type	Classes	Subtypes
`BRCA_MolSub`	Breast	4	LumA, LumB, Basal, Her2
`CRC_CMSSub`	Colorectal	4	CMS1, CMS2, CMS3, CMS4
`BLCA_MolSub`	Bladder	4	Luminal, Lum_infiltrated, Lum_papillary, Basal_squamous
`GBMLGG_MolSub`	Brain	5	G-CIMP-high, Codel, Mesenchymal, Classic, Other
`HNSC_MolSub`	Head/Neck	4	Classical, Basal, Mesenchymal, Atypical
`KIRC_MolSub`	Kidney	4	KIRC.1, KIRC.2, KIRC.3, KIRC.4
`PanGI_GISub`	Pan-GI	5	MSI, CIN, EBV, HM-SNV, GS
`UCEC_MolSub`	Endometrial	4	CN_HIGH, CN_LOW, MSI, POLE

Survival Prediction

Cancer Type	Cohort
Breast Cancer	BRCA
Colorectal Cancer	CRC
Glioblastoma	GBM
Low-Grade Glioma	LGG
Lung Adenocarcinoma	LUAD
Lung Squamous	LUSC
Bladder	BLCA
Head/Neck	HNSC
Liver	LIHC
Kidney	KIRP
Melanoma	SKCM
Stomach	STAD
Endometrial	UCEC

Data Preparation

1. Download TCGA Data

Install GDC Data Transfer Tool:

# Download from GDC
wget https://gdc.cancer.gov/system/files/public/file/gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
unzip gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
chmod +x gdc-client

# Download WSI slides
./gdc-client download -m gdc_manifest_slides.txt -d ./tcga_slides/

Download RNA-seq data from cBioPortal:

# Download from: https://www.cbioportal.org/datasets
wget https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz
tar -xzf brca_tcga_pub2015.tar.gz

2. Extract WSI Features

We recommend using PrePATH for feature extraction. Please refer to the PrePATH repository for detailed installation and usage instructions.

Expected feature format:

File type: .pt (PyTorch tensor)
Shape: (N_patches, feature_dim) where feature_dim depends on the model (CONCH: 512, UNI: 1024, Virchow2: 2560)

3. Prepare Gene Expression Data

Expected format (CSV):

Gene,Value
TP53,5.234
EGFR,3.123
KRAS,7.456
...

4. Prepare Pathway Signatures

Download Hallmark gene sets from MSigDB:

Visit: https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp
Download: h.all.v2025.1.Hs.symbols.gmt
Convert to CSV format with pathway names as columns

FAQ

Q: Do I need transcriptomic data for inference?

A: No! PathLUPI only requires WSI features at inference time. Transcriptomic data is only needed during training to learn genome-anchored representations.

Q: What WSI feature extractors are supported?

A: We primarily use CONCH (512-dim features). Other foundation models like UNI, Virchow, or ResNet50 can also be used with minor modifications to path_size parameter.

Q: How do I use my own dataset?

A:

Extract WSI features using PrePATH
Prepare a splits CSV with columns: ID, WSI, Gene, Fold 0, Fold 1, etc.
Prepare a labels CSV with your target labels
Run training with your custom files

Q: What GPU memory is required?

A: Approximately 10-20 GB VRAM for training, 4-6 GB for inference, depending on the embedding dimension and patch count of your WSI.

Citation

If you find this work useful, please cite:

@article{jin2024pathlupi,
  title={Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images},
  author={Jin, Cheng and Zhou, Fengtao and Yu, Yunfang and Ma, Jiabo and Wang, Yihui and Xu, Yingxue and Zhou, Huajun and Jiang, Hao and Luo, Luyang and Mao, Luhui and He, Zifan and Zhang, Xiuming and Zhang, Jing and Chan, Ronald Cheong Kin and Yao, Herui and Chen, Hao},
  year={2024}
}

Ethical Considerations

This study adhered to the Declaration of Helsinki and received ethical approval from the Human and Artifact Research Ethics Committee of The Hong Kong University of Science and Technology (HREP-2024-0423). All data used were anonymized and obtained through appropriate data use agreements.

License

This project is licensed under the CC-BY-NC-ND 4.0 License - see the LICENSE file for details.

Acknowledgement

This work was supported by:

National Natural Science Foundation of China (No. 62202403)
Innovation and Technology Commission (Project No. MHP/002/22 and ITCPD/17-9)
Research Grants Council of the Hong Kong Special Administrative Region, China (Project No: T45-401/22-N)
National Key R&D Program of China (Project No. 2023YFE0204000)

Contact

For questions or issues, please open a GitHub issue or contact Cheng Jin at cheng.jin@connect.ust.hk

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
labels		labels
models/PathLUPI		models/PathLUPI
scripts		scripts
splits		splits
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
main.py		main.py
requirements.txt		requirements.txt

License

ChengJin-git/PathLUPI

Folders and files

Latest commit

History

Repository files navigation

PathLUPI: Genome-Anchored Foundation Model for Molecular Prediction from Histology Images

Table of Contents

Abstract

Key Features

Installation

Requirements

Step-by-Step Installation

Verify Installation

Quick Start

Option 1: Inference on a New WSI (Recommended for New Users)

Option 2: Train a New Model

Inference Guide (Predict from WSI)

Prerequisites

Step 1: Extract Features from Your WSI

Step 2: Run Inference

Step 3: Interpret Results

Inference Parameters

Training Guide

Training Workflow

Single Task Training

Batch Training (All Tasks)

Training Parameters

Statistical Testing

One-sided Wilcoxon Signed-Rank Test

Bootstrap Confidence Intervals

Project Structure

Supported Tasks

Biomarker Prediction Tasks

Molecular Subtyping Tasks

Survival Prediction

Data Preparation

1. Download TCGA Data

2. Extract WSI Features

3. Prepare Gene Expression Data

4. Prepare Pathway Signatures

FAQ

Q: Do I need transcriptomic data for inference?

Q: What WSI feature extractors are supported?

Q: How do I use my own dataset?

Q: What GPU memory is required?

Citation

Ethical Considerations

License

Acknowledgement

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages