Skip to content

ChengJin-git/PathLUPI

Repository files navigation

PathLUPI: Genome-Anchored Foundation Model for Molecular Prediction from Histology Images

License: CC BY-NC-ND 4.0

Cheng Jin, Fengtao Zhou, Yunfang Yu, Jiabo Ma, Yihui Wang, Yingxue Xu, Huajun Zhou, Hao Jiang, Luyang Luo, Luhui Mao, Zifan He, Xiuming Zhang, Jing Zhang, Ronald Cheong Kin Chan, Herui Yao, and Hao Chen


Table of Contents


Abstract

Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs.

Key Results:

  • AUC ≥ 0.80 in 14 of the biomarker prediction and molecular subtyping tasks
  • C-index ≥ 0.70 in survival cohorts of 5 major cancer types

Key Features

Feature Description
WSI-only Inference Only requires histology images at inference time - no genomic data needed
Multi-task Support Supports 27+ molecular subtyping and biomarker prediction tasks
Survival Prediction Predicts patient prognosis across multiple cancer types
Foundation Model Backbone Uses CONCH for robust WSI feature extraction
Pathway-guided Learning Leverages 50 Hallmark pathways for cross-modal reconstruction

Installation

Requirements

  • Python 3.8+
  • PyTorch 1.12+
  • CUDA 11.8+ (for GPU acceleration)

Step-by-Step Installation

# 1. Clone repository
git clone https://github.com/ChengJin-git/PathLUPI.git
cd PathLUPI

# 2. Create conda environment
conda create -n pathlupi python=3.8
conda activate pathlupi

# 3. Install PyTorch (with CUDA support)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 4. Install other dependencies
pip install -r requirements.txt

# 5. Install ASlide for WSI reading (optional, for more WSI format support)
pip install git+https://github.com/MrPeterJin/ASlide.git

Verify Installation

import torch
from models.PathLUPI.network import PathLUPI

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")

# Test model instantiation
model = PathLUPI(omic_sizes=[200]*50, n_classes=2, path_size=512)
print("PathLUPI model loaded successfully!")

Quick Start

Option 1: Inference on a New WSI (Recommended for New Users)

If you have a trained model and want to predict molecular subtypes from a new WSI:

python inference.py \
    --checkpoint checkpoints/BRCA_ER_best.pth.tar \
    --wsi_features /path/to/your/wsi_features.pt \
    --task BRCA_ERSub \
    --signatures labels/signatures/hallmarks_signatures.csv

Option 2: Train a New Model

python main.py \
    --model PathLUPI \
    --task BRCA_ERSub \
    --excel_file splits/biomarker/BRCA_ER_Splits.csv \
    --label labels/biomarker/BRCA_Biomarker_Labels.csv \
    --root_path /path/to/wsi/features \
    --root_omic /path/to/transcriptomic/data \
    --signatures labels/signatures/hallmarks_signatures.csv \
    --modal WSI_Gene \
    --fold 0,1,2,3,4

Inference Guide (Predict from WSI)

This section explains how to use a trained PathLUPI model to make predictions on new WSI samples.

Prerequisites

  1. Trained Model Checkpoint (.pth.tar file)
  2. WSI Features (.pt file extracted using foundation models like CONCH, UNI, Virchow, etc.)
  3. Pathway Signatures (hallmarks_signatures.csv)

Step 1: Extract Features from Your WSI

We recommend using PrePATH for feature extraction. Please refer to the PrePATH repository for detailed installation and usage instructions.

Expected output: A .pt file containing features of shape (N_patches, feature_dim)

  • CONCH: 512-dim features
  • UNI: 1024-dim features
  • Virchow2: 2560-dim features (1280 × 2 with CLS token)

Step 2: Run Inference

python inference.py \
    --checkpoint /path/to/trained_model.pth.tar \
    --wsi_features /path/to/features/slide_name.pt \
    --task BRCA_ERSub \
    --signatures labels/signatures/hallmarks_signatures.csv \
    --output predictions.json

Step 3: Interpret Results

For Subtyping Tasks:

{
  "predicted_class": 1,
  "predicted_label": "ER+",
  "probabilities": {
    "ER-": 0.1234,
    "ER+": 0.8766
  }
}

For Survival Prediction:

{
  "risk_score": 2.345,
  "survival_probability": [0.95, 0.85, 0.72, 0.58]
}

Inference Parameters

Parameter Description Required
--checkpoint Path to trained model checkpoint
--wsi_features Path to extracted WSI features (.pt)
--task Prediction task (e.g., BRCA_ERSub)
--signatures Path to pathway signatures CSV
--output Output file path (JSON)
--device Device (cuda/cpu)
--region_num Number of pathway regions (default: 50)

Training Guide

Training Workflow

1. Prepare Data → 2. Configure Task → 3. Train Model → 4. Evaluate

Single Task Training

python main.py \
    --model PathLUPI \
    --task BRCA_ERSub \
    --excel_file splits/biomarker/BRCA_ER_Splits.csv \
    --label labels/biomarker/BRCA_Biomarker_Labels.csv \
    --root_path /path/to/wsi/features/conch \
    --root_omic /path/to/transcriptomic/data \
    --signatures labels/signatures/hallmarks_signatures.csv \
    --modal WSI_Gene \
    --fold 0,1,2,3,4 \
    --num_epoch 30 \
    --lr 2e-4 \
    --region_num 50 \
    --loss ce_l1

Batch Training (All Tasks)

# Molecular subtyping tasks
bash scripts/subtyping_conch.sh

# Survival prediction tasks
bash scripts/survival_conch.sh

Training Parameters

Parameter Description Default
--model Model architecture PathLUPI
--task Prediction task -
--fold Cross-validation folds 0,1,2,3,4
--num_epoch Training epochs 30
--lr Learning rate 2e-4
--region_num Number of pathways 50
--ratio Loss weight ratio (λ) 0.3
--loss Loss function ce_l1 / nll_surv

Statistical Testing

We provide standardized statistical testing methodologies for model comparison:

One-sided Wilcoxon Signed-Rank Test

Used for pairwise comparison between PathLUPI and baseline methods across multiple tasks:

from scipy.stats import wilcoxon

# Compare performance across tasks
# H0: PathLUPI performance <= baseline performance
# H1: PathLUPI performance > baseline performance
stat, p_value = wilcoxon(pathlupi_scores, baseline_scores, alternative='greater')

Bootstrap Confidence Intervals

Used for estimating uncertainty in performance metrics:

from sklearn.utils import resample
import numpy as np

def bootstrap_ci(y_true, y_pred, metric_fn, n_bootstrap=1000, ci=0.95):
    """Calculate bootstrap confidence interval for a metric."""
    scores = []
    for _ in range(n_bootstrap):
        indices = resample(range(len(y_true)), replace=True)
        score = metric_fn(y_true[indices], y_pred[indices])
        scores.append(score)

    alpha = (1 - ci) / 2
    lower = np.percentile(scores, alpha * 100)
    upper = np.percentile(scores, (1 - alpha) * 100)
    return np.mean(scores), lower, upper

Project Structure

PathLUPI/
├── main.py                     # Main training script
├── inference.py                # Inference script (WSI-only prediction)
├── requirements.txt            # Python dependencies
│
├── models/PathLUPI/            # Model implementation
│   ├── network.py              # PathLUPI architecture
│   ├── engine.py               # Training/validation engine
│   ├── util.py                 # Model utilities (attention, losses)
│   ├── wsi_embed.py            # WSI re-embedding module
│   └── rmsa.py                 # Region-based multi-head self-attention
│
├── datasets/                   # Dataset loaders
│   ├── TCGA_Subtype.py         # Subtyping dataset
│   └── TCGA_Survival.py        # Survival dataset
│
├── labels/                     # Label files
│   ├── biomarker/              # Biomarker labels (ER, HER2, BRAF, etc.)
│   ├── molecular/              # Molecular subtype labels (PAM50, CMS, etc.)
│   └── signatures/             # Pathway signatures (Hallmark pathways)
│
├── splits/                     # Train/val splits
│   ├── biomarker/              # Biomarker task splits
│   ├── molecular/              # Molecular subtyping splits
│   └── survival/               # Survival prediction splits
│
├── scripts/                    # Batch training scripts
│   ├── subtyping_conch.sh      # All subtyping tasks
│   └── survival_conch.sh       # All survival tasks
│
└── utils/                      # Utilities
    ├── options.py              # Argument parser
    ├── loss.py                 # Loss functions
    ├── optimizer.py            # Optimizer definitions
    └── scheduler.py            # Learning rate schedulers

Supported Tasks

Biomarker Prediction Tasks

Task Cancer Type Classes Description
BLCA_FGFR3Sub Bladder 2 FGFR3 mutation status
BRCA_ERSub Breast 2 Estrogen Receptor status
BRCA_PRSub Breast 2 Progesterone Receptor status
BRCA_HER2Sub Breast 2 HER2 amplification status
BRCA_TNBCSub Breast 2 Triple-negative classification
BRCA_PIK3CASub Breast 2 PIK3CA mutation status
BRCA_TP53Sub Breast 2 TP53 mutation status
GBMLGG_IDH1Sub Brain 2 IDH1 mutation status
CRC_BRAFSub Colorectal 2 BRAF mutation status
CRC_KRASSub Colorectal 2 KRAS mutation status
CRC_TP53Sub Colorectal 2 TP53 mutation status
LUAD_EGFRSub Lung 2 EGFR mutation status
LUAD_KRASSub Lung 2 KRAS mutation status
LUAD_TP53Sub Lung 2 TP53 mutation status
LIHC_TP53Sub Liver 2 TP53 mutation status
SKCM_BRAFSub Melanoma 2 BRAF mutation status
NSCLC_TMBSub Lung 2 Tumor Mutation Burden

Molecular Subtyping Tasks

Task Cancer Type Classes Subtypes
BRCA_MolSub Breast 4 LumA, LumB, Basal, Her2
CRC_CMSSub Colorectal 4 CMS1, CMS2, CMS3, CMS4
BLCA_MolSub Bladder 4 Luminal, Lum_infiltrated, Lum_papillary, Basal_squamous
GBMLGG_MolSub Brain 5 G-CIMP-high, Codel, Mesenchymal, Classic, Other
HNSC_MolSub Head/Neck 4 Classical, Basal, Mesenchymal, Atypical
KIRC_MolSub Kidney 4 KIRC.1, KIRC.2, KIRC.3, KIRC.4
PanGI_GISub Pan-GI 5 MSI, CIN, EBV, HM-SNV, GS
UCEC_MolSub Endometrial 4 CN_HIGH, CN_LOW, MSI, POLE

Survival Prediction

Cancer Type Cohort
Breast Cancer BRCA
Colorectal Cancer CRC
Glioblastoma GBM
Low-Grade Glioma LGG
Lung Adenocarcinoma LUAD
Lung Squamous LUSC
Bladder BLCA
Head/Neck HNSC
Liver LIHC
Kidney KIRP
Melanoma SKCM
Stomach STAD
Endometrial UCEC

Data Preparation

1. Download TCGA Data

Install GDC Data Transfer Tool:

# Download from GDC
wget https://gdc.cancer.gov/system/files/public/file/gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
unzip gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
chmod +x gdc-client

# Download WSI slides
./gdc-client download -m gdc_manifest_slides.txt -d ./tcga_slides/

Download RNA-seq data from cBioPortal:

# Download from: https://www.cbioportal.org/datasets
wget https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz
tar -xzf brca_tcga_pub2015.tar.gz

2. Extract WSI Features

We recommend using PrePATH for feature extraction. Please refer to the PrePATH repository for detailed installation and usage instructions.

Expected feature format:

  • File type: .pt (PyTorch tensor)
  • Shape: (N_patches, feature_dim) where feature_dim depends on the model (CONCH: 512, UNI: 1024, Virchow2: 2560)

3. Prepare Gene Expression Data

Expected format (CSV):

Gene,Value
TP53,5.234
EGFR,3.123
KRAS,7.456
...

4. Prepare Pathway Signatures

Download Hallmark gene sets from MSigDB:


FAQ

Q: Do I need transcriptomic data for inference?

A: No! PathLUPI only requires WSI features at inference time. Transcriptomic data is only needed during training to learn genome-anchored representations.

Q: What WSI feature extractors are supported?

A: We primarily use CONCH (512-dim features). Other foundation models like UNI, Virchow, or ResNet50 can also be used with minor modifications to path_size parameter.

Q: How do I use my own dataset?

A:

  1. Extract WSI features using PrePATH
  2. Prepare a splits CSV with columns: ID, WSI, Gene, Fold 0, Fold 1, etc.
  3. Prepare a labels CSV with your target labels
  4. Run training with your custom files

Q: What GPU memory is required?

A: Approximately 10-20 GB VRAM for training, 4-6 GB for inference, depending on the embedding dimension and patch count of your WSI.


Citation

If you find this work useful, please cite:

@article{jin2024pathlupi,
  title={Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images},
  author={Jin, Cheng and Zhou, Fengtao and Yu, Yunfang and Ma, Jiabo and Wang, Yihui and Xu, Yingxue and Zhou, Huajun and Jiang, Hao and Luo, Luyang and Mao, Luhui and He, Zifan and Zhang, Xiuming and Zhang, Jing and Chan, Ronald Cheong Kin and Yao, Herui and Chen, Hao},
  year={2024}
}

Ethical Considerations

This study adhered to the Declaration of Helsinki and received ethical approval from the Human and Artifact Research Ethics Committee of The Hong Kong University of Science and Technology (HREP-2024-0423). All data used were anonymized and obtained through appropriate data use agreements.


License

This project is licensed under the CC-BY-NC-ND 4.0 License - see the LICENSE file for details.


Acknowledgement

This work was supported by:

  • National Natural Science Foundation of China (No. 62202403)
  • Innovation and Technology Commission (Project No. MHP/002/22 and ITCPD/17-9)
  • Research Grants Council of the Hong Kong Special Administrative Region, China (Project No: T45-401/22-N)
  • National Key R&D Program of China (Project No. 2023YFE0204000)

Contact

For questions or issues, please open a GitHub issue or contact Cheng Jin at cheng.jin@connect.ust.hk

About

Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •