Add LLM inference support to JMLC API by kubraaksux · Pull Request #2430 · apache/systemds

kubraaksux · 2026-02-13T16:58:11Z

Adds LLM text generation to the JMLC API using Py4J to bridge Java and Python (HuggingFace models).

Changes

Connection.java: loadModel() / releaseModel() to start and stop the Python worker (300s timeout for large models)
PreparedScript.java: generateBatchWithMetrics() for batch inference via FrameBlock — now uses a single generateBatch() call to the Python worker instead of a per-prompt loop
LLMCallback.java: Java interface for the Py4J callback, including generateBatch() for batched GPU inference
llm_worker.py: Python worker that loads HuggingFace models and serves inference requests, with batched tokenization and model.generate() for GPU parallelism
JMLCLLMInferenceTest.java: Integration test using distilgpt2

GPU batching

The latest update adds true GPU batching: all prompts are tokenized together (with padding) and processed in a single model.generate() call. This achieves 3-14x speedup over the previous sequential per-prompt approach on NVIDIA H100, making SystemDS JMLC faster than sequential vLLM for batch workloads. See #2431 for full benchmark results.

Test

mvn test -Dtest=JMLCLLMInferenceTest -pl .

Also evaluated with Qwen/Qwen2.5-3B-Instruct and mistralai/Mistral-7B-Instruct-v0.3 on NVIDIA H100 in the benchmarking framework (#2431).

- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath) - Connection.java: Removed findPythonScript() method - LLMCallback.java: Added Javadoc for generate() method - JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()

- Connection.java: Auto-find available ports for Py4J communication - Connection.java: Add loadModel() overload for manual port override - Connection.java: Use destroyForcibly() with waitFor() for clean shutdown - llm_worker.py: Accept python_port as command line argument

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

- Add generateWithTokenCount() returning JSON with input/output token counts - Update generateBatchWithMetrics() to include input_tokens and output_tokens columns - Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py - Check Python process liveness during startup instead of blind 60s timeout

7B+ models need more time to load weights into GPU memory.

e-strauss · 2026-02-16T15:30:58Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/dictionary/MatrixBlockDictionary.java

Hi, how are these changes related to the llm inference?

Hi, you're right. They seem to be from the Nicolas Korjahn's shampoo optimizer code that was already in my branch when I branched off main. They got accidentally included in my commit. I've reverted them now, the file should be back to its original state. Sorry about that!

This file was accidentally modified in a prior commit. Restoring the original vectorized SIMD implementation.

- LLMCallback.java: add generateBatch() interface method - PreparedScript.java: replace per-prompt for-loop with single batch call - llm_worker.py: implement batched tokenization and model.generate() Achieves 3-14x speedup over sequential inference on H100.

generateBatchWithMetrics() now accepts a boolean batched parameter: true for GPU-batched (new), false for original sequential for-loop.

e-strauss · 2026-02-16T17:24:04Z

Hi @kubraaksux , thanks for the contribution!
I have a concern about the current approach: I’m not sure moving LLM inference into Python is the right direction, especially since most calls still go through Python wrapper functions and there’s additional overhead from using Py4J.
Also, as implemented now, it seems we’re bypassing systemd’s core functionality entirely.
Looping in @mboehm7 .

…ssion

kubraaksux · 2026-02-16T19:58:07Z

Hi @kubraaksux , thanks for the contribution! I have a concern about the current approach: I’m not sure moving LLM inference into Python is the right direction, especially since most calls still go through Python wrapper functions and there’s additional overhead from using Py4J. Also, as implemented now, it seems we’re bypassing systemd’s core functionality entirely. Looping in @mboehm7 .

Hi @e-strauss, thanks for the feedback. Both points are valid.

I redesigned the approach. Instead of the Py4J bridge, llmPredict is now a native parameterized built-in. The DML goes through the full compilation pipeline: parser → hops → lops → CP instruction. The instruction makes HTTP calls directly via java.net.HttpURLConnection.

Thanks again for catching this early.

- Use proper imports instead of inline fully-qualified class names - Add try-with-resources for HTTP streams to prevent resource leaks - Add connect/read timeouts to HTTP calls - Add lineage tracing support for llmPredict - Add checkInvalidParameters validation in parser - Remove .claude/.env/meeting_notes from .gitignore - Trim verbose docstrings

e-strauss · 2026-02-16T20:33:26Z

Hey @kubraaksux, just sharing my thoughts here — not trying to push in any direction, since I’m not the project supervisor. Let’s wait for Matthias’s feedback.

Supports parallel HTTP calls to the inference server via ExecutorService. Default concurrency=1 keeps sequential behavior.

kubraaksux · 2026-02-16T21:00:56Z

Hey @kubraaksux, just sharing my thoughts here — not trying to push in any direction, since I’m not the project supervisor. Let’s wait for Matthias’s feedback.

Your points were helpful. I've reworked the approach accordingly. Looking forward to @mboehm7 's input.

JMLC requires the LHS variable name in read() assignments to match the input name registered in prepareScript(). Changed X/R to prompts/results so RewriteRemovePersistentReadWrite correctly converts persistent reads to transient reads.

github-project-automation bot added this to SystemDS PR Queue Feb 13, 2026

github-project-automation bot moved this to In Progress in SystemDS PR Queue Feb 13, 2026

kubraaksux added 2 commits February 13, 2026 18:04

Add LLM inference support to JMLC API via Py4J bridge

8e7d6da

kubraaksux force-pushed the llm-api branch from 53a7d70 to 47dd0db Compare February 13, 2026 17:04

kubraaksux added 3 commits February 13, 2026 18:20

Move llm_worker.py to fix Python module collision

dacdc1c

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

Use python3 with fallback to python in Connection.java

29f657c

kubraaksux force-pushed the llm-api branch from d9b9f37 to 29f657c Compare February 14, 2026 15:10

kubraaksux added 3 commits February 14, 2026 17:04

Add batch inference with FrameBlock and metrics support

e40e4f2

Clean up test: extract constants and shared setup method

fdd1684

kubraaksux mentioned this pull request Feb 16, 2026

LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project #2431

Open

kubraaksux marked this pull request as ready for review February 16, 2026 14:51

Increase worker startup timeout to 300s for larger models

2e984a2

7B+ models need more time to load weights into GPU memory.

e-strauss reviewed Feb 16, 2026

View reviewed changes

Revert accidental changes to MatrixBlockDictionary.java

bf666c2

This file was accidentally modified in a prior commit. Restoring the original vectorized SIMD implementation.

kubraaksux force-pushed the llm-api branch from 936bd7f to bf666c2 Compare February 16, 2026 16:16

kubraaksux added 3 commits February 16, 2026 17:51

Keep both sequential and batched inference modes in PreparedScript

c9c85d4

generateBatchWithMetrics() now accepts a boolean batched parameter: true for GPU-batched (new), false for original sequential for-loop.

Add gitignore rules for .env files, meeting notes, and local tool config

4b44dd1

kubraaksux added 7 commits February 16, 2026 20:02

Add llmPredict builtin, opcode and ParamBuiltinOp entries

72bc334

Add llmPredict parser validation in ParameterizedBuiltinFunctionExpre…

0ad1b56

…ssion

Wire llmPredict through hop, lop and instruction generation

1e48362

Add llmPredict CP instruction with HTTP-based inference

de675ac

Remove Py4J-based LLM inference from JMLC API

5eab87d

Rewrite LLM test to use llmPredict DML built-in

bea062a

Add OpenAI-compatible HTTP inference server for HuggingFace models

edf4e39

kubraaksux added 2 commits February 16, 2026 21:33

Add concurrency parameter to llmPredict built-in

c3e9a1f

Supports parallel HTTP calls to the inference server via ExecutorService. Default concurrency=1 keeps sequential behavior.

Remove license header from test, clarify llm_server.py docstring

53e3feb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM inference support to JMLC API#2430

Add LLM inference support to JMLC API#2430
kubraaksux wants to merge 24 commits intoapache:mainfrom
kubraaksux:llm-api

kubraaksux commented Feb 13, 2026 •

edited

Loading

Uh oh!

e-strauss Feb 16, 2026

Uh oh!

kubraaksux Feb 16, 2026

Uh oh!

e-strauss commented Feb 16, 2026

Uh oh!

kubraaksux commented Feb 16, 2026

Uh oh!

e-strauss commented Feb 16, 2026

Uh oh!

kubraaksux commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kubraaksux commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

GPU batching

Test

Uh oh!

e-strauss Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

kubraaksux Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

e-strauss commented Feb 16, 2026

Uh oh!

kubraaksux commented Feb 16, 2026

Uh oh!

e-strauss commented Feb 16, 2026

Uh oh!

kubraaksux commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kubraaksux commented Feb 13, 2026 •

edited

Loading

kubraaksux commented Feb 16, 2026 •

edited

Loading