-
-
Notifications
You must be signed in to change notification settings - Fork 165
Description
Issue description
gpuLayers: "auto" can underestimate VRAM pressure from concurrent GPU processes, causing an unrecoverable CUDA SIGABRT (exit 134) instead of a catchable error or graceful CPU fallback.
Expected Behavior
When gpuLayers: "auto" cannot fit the model in available VRAM, the behavior should be one of:
- Fall back to fewer GPU layers (or CPU-only) automatically
- Throw a catchable
InsufficientMemoryError(or similar) that the application can handle - At minimum, provide an error message before the process terminates
Actual Behavior
The process is killed by a native CUDA abort (SIGABRT, exit code 134) with no opportunity for JavaScript error handling. The crash originates from ggml-cuda.cu:96 in the native library. This is uncatchable in JavaScript -- no try/catch, no process.on('uncaughtException'), no process.on('SIGABRT') can intercept it.
The crash output looks like:
/home/runner/work/node-llama-cpp/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: CUDA error
[node-llama-cpp] CUDA error: out of memory
Followed by a native stack trace and SIGABRT.
Steps to reproduce
Methodology:
- We built a CUDA VRAM ballast tool to allocate precise amounts of VRAM, simulating a concurrent large model (e.g., a 70B parameter model running in ollama)
- We ran
qmd embed(which uses node-llama-cpp withgpuLayers: "auto"and the embeddinggemma model, ~300 MB) at different VRAM pressure levels - Test hardware: NVIDIA GeForce RTX 3090 Ti (24 GB VRAM)
Minimal reproduction:
import { getLlama } from "node-llama-cpp";
// Precondition: another process is using most of the GPU VRAM
// (e.g., ollama running a 70B model, or use nvidia-smi to verify <400 MiB free)
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: "/path/to/embedding-model.gguf",
// gpuLayers: "auto" -- this is the default
});
const context = await model.createEmbeddingContext();
// → Process killed with SIGABRT exit code 134Test results at various VRAM pressure levels:
| VRAM Pressure | Free VRAM | Behavior | Exit Code |
|---|---|---|---|
| None (baseline) | 24,123 MiB | Normal operation, 88% GPU util, 79.4 KB/s | 0 |
| 16 GB ballast | ~7,856 MiB | Works fine, 88% GPU util, 75.4 KB/s | 0 (data, not clean exit) |
| 22 GB ballast | ~1,856 MiB | OOM Killed by Linux OOM killer | 137 (SIGKILL) |
| 23 GB ballast | ~856 MiB | Partial GPU offload, 3.3x slower, data errors | 137 |
| 23.5 GB ballast | ~356 MiB | CUDA OOM crash (SIGABRT) | 134 |
Key observation: At 356 MiB free VRAM, gpuLayers: "auto" attempted to load layers to GPU despite insufficient memory. Instead of detecting the failure and falling back to CPU (or fewer layers), the native CUDA allocation failed with cudaMalloc: out of memory, which triggered SIGABRT in the native library.
My Environment
My Environment
| Dependency | Version |
|---|---|
| Operating System | Ubuntu 24.04.3 LTS (x64), kernel 6.8.0-94-generic |
| CPU | AMD Ryzen 7 7800X3D 8-Core Processor |
| GPU | NVIDIA GeForce RTX 3090 Ti (24,564 MiB VRAM) |
| NVIDIA Driver | 570.211.01 |
| Runtime | Bun 1.3.9 (also tested Node.js 22.22.0) |
| TypeScript version | 5.9.3 |
node-llama-cpp version |
3.14.5 |
| Prebuilt binaries | b7347 |
npx --yes node-llama-cpp inspect gpu output:
OS: Ubuntu 24.04.3 LTS (x64)
Node: 22.22.0 (x64)
TypeScript: 5.9.3
node-llama-cpp: 3.14.5
Prebuilt binaries: b7347
CUDA: available
Vulkan: available
CUDA device: NVIDIA GeForce RTX 3090 Ti
CUDA used VRAM: 1.1% (266.19MB/23.56GB)
CUDA free VRAM: 98.89% (23.3GB/23.56GB)
Vulkan device: AMD Ryzen 7 7800X3D 8-Core Processor (RADV RAPHAEL_MENDOCINO)
Vulkan used VRAM: 0.12% (60.76MB/47.22GB)
Vulkan free VRAM: 99.87% (47.16GB/47.22GB)
CPU model: AMD Ryzen 7 7800X3D 8-Core Processor
Math cores: 8
Used RAM: 6.76% (6.32GB/93.44GB)
Free RAM: 93.23% (87.12GB/93.44GB)
Used swap: 1.73% (142.25MB/8GB)
Max swap size: 8GB
mmap: supported
Additional Context
Why this matters for applications
Applications using node-llama-cpp for embeddings (like qmd) are run on developer machines alongside other GPU-consuming processes (ollama, ComfyUI, training jobs). The SIGABRT is unrecoverable -- there is no way to catch it in JavaScript, so applications cannot provide error messages, retry with CPU-only, or suggest workarounds to users.
The auto estimation gap
Looking at resolveModelGpuLayersOption.ts, the auto mode estimates VRAM requirements from GGUF metadata before loading. When VRAM is consumed by external processes between estimation and actual CUDA allocation, the estimate becomes stale. The troubleshooting docs acknowledge estimation inaccuracies ("The built-in estimation mechanism may overestimate requirements") but the dangerous direction is underestimation.
Possible mitigations (suggestions, not demands)
- Safety margin: Build a safety margin into the auto VRAM estimation (e.g., reserve 10-15% of estimated free VRAM)
- Catch and retry: Catch native CUDA allocation failures and retry with fewer layers (or fall back to CPU)
- Double-check: Use a second VRAM check after estimation but before actual allocation
- Conservative mode: Expose a "conservative" auto mode that leaves more VRAM headroom
Related issues
- bug: Qwen Embedded doesn't work. #519 -- crash with
gpuLayers: "auto"on Vulkan (similar failure mode on a different backend) - bug: Failed to create context on M5 #549 -- context creation failures (potentially related to VRAM estimation)
Relevant Features Used
- Metal support
- CUDA support
- Vulkan support
- Grammar
- Function calling
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, and I know how to start.