Cheng Jin, Fengtao Zhou, Yunfang Yu, Jiabo Ma, Yihui Wang, Yingxue Xu, Huajun Zhou, Hao Jiang, Luyang Luo, Luhui Mao, Zifan He, Xiuming Zhang, Jing Zhang, Ronald Cheong Kin Chan, Herui Yao, and Hao Chen
- Abstract
- Key Features
- Installation
- Quick Start
- Inference Guide
- Training Guide
- Statistical Testing
- Data Preparation
- Supported Tasks
- Project Structure
- Citation
Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs.
Key Results:
- AUC ≥ 0.80 in 14 of the biomarker prediction and molecular subtyping tasks
- C-index ≥ 0.70 in survival cohorts of 5 major cancer types
| Feature | Description |
|---|---|
| WSI-only Inference | Only requires histology images at inference time - no genomic data needed |
| Multi-task Support | Supports 27+ molecular subtyping and biomarker prediction tasks |
| Survival Prediction | Predicts patient prognosis across multiple cancer types |
| Foundation Model Backbone | Uses CONCH for robust WSI feature extraction |
| Pathway-guided Learning | Leverages 50 Hallmark pathways for cross-modal reconstruction |
- Python 3.8+
- PyTorch 1.12+
- CUDA 11.8+ (for GPU acceleration)
# 1. Clone repository
git clone https://github.com/ChengJin-git/PathLUPI.git
cd PathLUPI
# 2. Create conda environment
conda create -n pathlupi python=3.8
conda activate pathlupi
# 3. Install PyTorch (with CUDA support)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# 4. Install other dependencies
pip install -r requirements.txt
# 5. Install ASlide for WSI reading (optional, for more WSI format support)
pip install git+https://github.com/MrPeterJin/ASlide.gitimport torch
from models.PathLUPI.network import PathLUPI
# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
# Test model instantiation
model = PathLUPI(omic_sizes=[200]*50, n_classes=2, path_size=512)
print("PathLUPI model loaded successfully!")If you have a trained model and want to predict molecular subtypes from a new WSI:
python inference.py \
--checkpoint checkpoints/BRCA_ER_best.pth.tar \
--wsi_features /path/to/your/wsi_features.pt \
--task BRCA_ERSub \
--signatures labels/signatures/hallmarks_signatures.csvpython main.py \
--model PathLUPI \
--task BRCA_ERSub \
--excel_file splits/biomarker/BRCA_ER_Splits.csv \
--label labels/biomarker/BRCA_Biomarker_Labels.csv \
--root_path /path/to/wsi/features \
--root_omic /path/to/transcriptomic/data \
--signatures labels/signatures/hallmarks_signatures.csv \
--modal WSI_Gene \
--fold 0,1,2,3,4This section explains how to use a trained PathLUPI model to make predictions on new WSI samples.
- Trained Model Checkpoint (
.pth.tarfile) - WSI Features (
.ptfile extracted using foundation models like CONCH, UNI, Virchow, etc.) - Pathway Signatures (
hallmarks_signatures.csv)
We recommend using PrePATH for feature extraction. Please refer to the PrePATH repository for detailed installation and usage instructions.
Expected output: A .pt file containing features of shape (N_patches, feature_dim)
- CONCH: 512-dim features
- UNI: 1024-dim features
- Virchow2: 2560-dim features (1280 × 2 with CLS token)
python inference.py \
--checkpoint /path/to/trained_model.pth.tar \
--wsi_features /path/to/features/slide_name.pt \
--task BRCA_ERSub \
--signatures labels/signatures/hallmarks_signatures.csv \
--output predictions.jsonFor Subtyping Tasks:
{
"predicted_class": 1,
"predicted_label": "ER+",
"probabilities": {
"ER-": 0.1234,
"ER+": 0.8766
}
}For Survival Prediction:
{
"risk_score": 2.345,
"survival_probability": [0.95, 0.85, 0.72, 0.58]
}| Parameter | Description | Required |
|---|---|---|
--checkpoint |
Path to trained model checkpoint | ✅ |
--wsi_features |
Path to extracted WSI features (.pt) | ✅ |
--task |
Prediction task (e.g., BRCA_ERSub) | ✅ |
--signatures |
Path to pathway signatures CSV | ✅ |
--output |
Output file path (JSON) | ❌ |
--device |
Device (cuda/cpu) | ❌ |
--region_num |
Number of pathway regions (default: 50) | ❌ |
1. Prepare Data → 2. Configure Task → 3. Train Model → 4. Evaluate
python main.py \
--model PathLUPI \
--task BRCA_ERSub \
--excel_file splits/biomarker/BRCA_ER_Splits.csv \
--label labels/biomarker/BRCA_Biomarker_Labels.csv \
--root_path /path/to/wsi/features/conch \
--root_omic /path/to/transcriptomic/data \
--signatures labels/signatures/hallmarks_signatures.csv \
--modal WSI_Gene \
--fold 0,1,2,3,4 \
--num_epoch 30 \
--lr 2e-4 \
--region_num 50 \
--loss ce_l1# Molecular subtyping tasks
bash scripts/subtyping_conch.sh
# Survival prediction tasks
bash scripts/survival_conch.sh| Parameter | Description | Default |
|---|---|---|
--model |
Model architecture | PathLUPI |
--task |
Prediction task | - |
--fold |
Cross-validation folds | 0,1,2,3,4 |
--num_epoch |
Training epochs | 30 |
--lr |
Learning rate | 2e-4 |
--region_num |
Number of pathways | 50 |
--ratio |
Loss weight ratio (λ) | 0.3 |
--loss |
Loss function | ce_l1 / nll_surv |
We provide standardized statistical testing methodologies for model comparison:
Used for pairwise comparison between PathLUPI and baseline methods across multiple tasks:
from scipy.stats import wilcoxon
# Compare performance across tasks
# H0: PathLUPI performance <= baseline performance
# H1: PathLUPI performance > baseline performance
stat, p_value = wilcoxon(pathlupi_scores, baseline_scores, alternative='greater')Used for estimating uncertainty in performance metrics:
from sklearn.utils import resample
import numpy as np
def bootstrap_ci(y_true, y_pred, metric_fn, n_bootstrap=1000, ci=0.95):
"""Calculate bootstrap confidence interval for a metric."""
scores = []
for _ in range(n_bootstrap):
indices = resample(range(len(y_true)), replace=True)
score = metric_fn(y_true[indices], y_pred[indices])
scores.append(score)
alpha = (1 - ci) / 2
lower = np.percentile(scores, alpha * 100)
upper = np.percentile(scores, (1 - alpha) * 100)
return np.mean(scores), lower, upperPathLUPI/
├── main.py # Main training script
├── inference.py # Inference script (WSI-only prediction)
├── requirements.txt # Python dependencies
│
├── models/PathLUPI/ # Model implementation
│ ├── network.py # PathLUPI architecture
│ ├── engine.py # Training/validation engine
│ ├── util.py # Model utilities (attention, losses)
│ ├── wsi_embed.py # WSI re-embedding module
│ └── rmsa.py # Region-based multi-head self-attention
│
├── datasets/ # Dataset loaders
│ ├── TCGA_Subtype.py # Subtyping dataset
│ └── TCGA_Survival.py # Survival dataset
│
├── labels/ # Label files
│ ├── biomarker/ # Biomarker labels (ER, HER2, BRAF, etc.)
│ ├── molecular/ # Molecular subtype labels (PAM50, CMS, etc.)
│ └── signatures/ # Pathway signatures (Hallmark pathways)
│
├── splits/ # Train/val splits
│ ├── biomarker/ # Biomarker task splits
│ ├── molecular/ # Molecular subtyping splits
│ └── survival/ # Survival prediction splits
│
├── scripts/ # Batch training scripts
│ ├── subtyping_conch.sh # All subtyping tasks
│ └── survival_conch.sh # All survival tasks
│
└── utils/ # Utilities
├── options.py # Argument parser
├── loss.py # Loss functions
├── optimizer.py # Optimizer definitions
└── scheduler.py # Learning rate schedulers
| Task | Cancer Type | Classes | Description |
|---|---|---|---|
BLCA_FGFR3Sub |
Bladder | 2 | FGFR3 mutation status |
BRCA_ERSub |
Breast | 2 | Estrogen Receptor status |
BRCA_PRSub |
Breast | 2 | Progesterone Receptor status |
BRCA_HER2Sub |
Breast | 2 | HER2 amplification status |
BRCA_TNBCSub |
Breast | 2 | Triple-negative classification |
BRCA_PIK3CASub |
Breast | 2 | PIK3CA mutation status |
BRCA_TP53Sub |
Breast | 2 | TP53 mutation status |
GBMLGG_IDH1Sub |
Brain | 2 | IDH1 mutation status |
CRC_BRAFSub |
Colorectal | 2 | BRAF mutation status |
CRC_KRASSub |
Colorectal | 2 | KRAS mutation status |
CRC_TP53Sub |
Colorectal | 2 | TP53 mutation status |
LUAD_EGFRSub |
Lung | 2 | EGFR mutation status |
LUAD_KRASSub |
Lung | 2 | KRAS mutation status |
LUAD_TP53Sub |
Lung | 2 | TP53 mutation status |
LIHC_TP53Sub |
Liver | 2 | TP53 mutation status |
SKCM_BRAFSub |
Melanoma | 2 | BRAF mutation status |
NSCLC_TMBSub |
Lung | 2 | Tumor Mutation Burden |
| Task | Cancer Type | Classes | Subtypes |
|---|---|---|---|
BRCA_MolSub |
Breast | 4 | LumA, LumB, Basal, Her2 |
CRC_CMSSub |
Colorectal | 4 | CMS1, CMS2, CMS3, CMS4 |
BLCA_MolSub |
Bladder | 4 | Luminal, Lum_infiltrated, Lum_papillary, Basal_squamous |
GBMLGG_MolSub |
Brain | 5 | G-CIMP-high, Codel, Mesenchymal, Classic, Other |
HNSC_MolSub |
Head/Neck | 4 | Classical, Basal, Mesenchymal, Atypical |
KIRC_MolSub |
Kidney | 4 | KIRC.1, KIRC.2, KIRC.3, KIRC.4 |
PanGI_GISub |
Pan-GI | 5 | MSI, CIN, EBV, HM-SNV, GS |
UCEC_MolSub |
Endometrial | 4 | CN_HIGH, CN_LOW, MSI, POLE |
| Cancer Type | Cohort |
|---|---|
| Breast Cancer | BRCA |
| Colorectal Cancer | CRC |
| Glioblastoma | GBM |
| Low-Grade Glioma | LGG |
| Lung Adenocarcinoma | LUAD |
| Lung Squamous | LUSC |
| Bladder | BLCA |
| Head/Neck | HNSC |
| Liver | LIHC |
| Kidney | KIRP |
| Melanoma | SKCM |
| Stomach | STAD |
| Endometrial | UCEC |
Install GDC Data Transfer Tool:
# Download from GDC
wget https://gdc.cancer.gov/system/files/public/file/gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
unzip gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
chmod +x gdc-client
# Download WSI slides
./gdc-client download -m gdc_manifest_slides.txt -d ./tcga_slides/Download RNA-seq data from cBioPortal:
# Download from: https://www.cbioportal.org/datasets
wget https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz
tar -xzf brca_tcga_pub2015.tar.gzWe recommend using PrePATH for feature extraction. Please refer to the PrePATH repository for detailed installation and usage instructions.
Expected feature format:
- File type:
.pt(PyTorch tensor) - Shape:
(N_patches, feature_dim)where feature_dim depends on the model (CONCH: 512, UNI: 1024, Virchow2: 2560)
Expected format (CSV):
Gene,Value
TP53,5.234
EGFR,3.123
KRAS,7.456
...
Download Hallmark gene sets from MSigDB:
- Visit: https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp
- Download:
h.all.v2025.1.Hs.symbols.gmt - Convert to CSV format with pathway names as columns
A: No! PathLUPI only requires WSI features at inference time. Transcriptomic data is only needed during training to learn genome-anchored representations.
A: We primarily use CONCH (512-dim features). Other foundation models like UNI, Virchow, or ResNet50 can also be used with minor modifications to path_size parameter.
A:
- Extract WSI features using PrePATH
- Prepare a splits CSV with columns:
ID,WSI,Gene,Fold 0,Fold 1, etc. - Prepare a labels CSV with your target labels
- Run training with your custom files
A: Approximately 10-20 GB VRAM for training, 4-6 GB for inference, depending on the embedding dimension and patch count of your WSI.
If you find this work useful, please cite:
@article{jin2024pathlupi,
title={Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images},
author={Jin, Cheng and Zhou, Fengtao and Yu, Yunfang and Ma, Jiabo and Wang, Yihui and Xu, Yingxue and Zhou, Huajun and Jiang, Hao and Luo, Luyang and Mao, Luhui and He, Zifan and Zhang, Xiuming and Zhang, Jing and Chan, Ronald Cheong Kin and Yao, Herui and Chen, Hao},
year={2024}
}This study adhered to the Declaration of Helsinki and received ethical approval from the Human and Artifact Research Ethics Committee of The Hong Kong University of Science and Technology (HREP-2024-0423). All data used were anonymized and obtained through appropriate data use agreements.
This project is licensed under the CC-BY-NC-ND 4.0 License - see the LICENSE file for details.
This work was supported by:
- National Natural Science Foundation of China (No. 62202403)
- Innovation and Technology Commission (Project No. MHP/002/22 and ITCPD/17-9)
- Research Grants Council of the Hong Kong Special Administrative Region, China (Project No: T45-401/22-N)
- National Key R&D Program of China (Project No. 2023YFE0204000)
For questions or issues, please open a GitHub issue or contact Cheng Jin at cheng.jin@connect.ust.hk