Skip to content

bug: gpuLayers "auto" causes unrecoverable CUDA SIGABRT when VRAM is consumed by external processes #551

@brettdavies

Description

@brettdavies

Issue description

gpuLayers: "auto" can underestimate VRAM pressure from concurrent GPU processes, causing an unrecoverable CUDA SIGABRT (exit 134) instead of a catchable error or graceful CPU fallback.

Expected Behavior

When gpuLayers: "auto" cannot fit the model in available VRAM, the behavior should be one of:

  1. Fall back to fewer GPU layers (or CPU-only) automatically
  2. Throw a catchable InsufficientMemoryError (or similar) that the application can handle
  3. At minimum, provide an error message before the process terminates

Actual Behavior

The process is killed by a native CUDA abort (SIGABRT, exit code 134) with no opportunity for JavaScript error handling. The crash originates from ggml-cuda.cu:96 in the native library. This is uncatchable in JavaScript -- no try/catch, no process.on('uncaughtException'), no process.on('SIGABRT') can intercept it.

The crash output looks like:

/home/runner/work/node-llama-cpp/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: CUDA error
[node-llama-cpp] CUDA error: out of memory

Followed by a native stack trace and SIGABRT.

Steps to reproduce

Methodology:

  1. We built a CUDA VRAM ballast tool to allocate precise amounts of VRAM, simulating a concurrent large model (e.g., a 70B parameter model running in ollama)
  2. We ran qmd embed (which uses node-llama-cpp with gpuLayers: "auto" and the embeddinggemma model, ~300 MB) at different VRAM pressure levels
  3. Test hardware: NVIDIA GeForce RTX 3090 Ti (24 GB VRAM)

Minimal reproduction:

import { getLlama } from "node-llama-cpp";

// Precondition: another process is using most of the GPU VRAM
// (e.g., ollama running a 70B model, or use nvidia-smi to verify <400 MiB free)

const llama = await getLlama();
const model = await llama.loadModel({
  modelPath: "/path/to/embedding-model.gguf",
  // gpuLayers: "auto"  -- this is the default
});
const context = await model.createEmbeddingContext();
// → Process killed with SIGABRT exit code 134

Test results at various VRAM pressure levels:

VRAM Pressure Free VRAM Behavior Exit Code
None (baseline) 24,123 MiB Normal operation, 88% GPU util, 79.4 KB/s 0
16 GB ballast ~7,856 MiB Works fine, 88% GPU util, 75.4 KB/s 0 (data, not clean exit)
22 GB ballast ~1,856 MiB OOM Killed by Linux OOM killer 137 (SIGKILL)
23 GB ballast ~856 MiB Partial GPU offload, 3.3x slower, data errors 137
23.5 GB ballast ~356 MiB CUDA OOM crash (SIGABRT) 134

Key observation: At 356 MiB free VRAM, gpuLayers: "auto" attempted to load layers to GPU despite insufficient memory. Instead of detecting the failure and falling back to CPU (or fewer layers), the native CUDA allocation failed with cudaMalloc: out of memory, which triggered SIGABRT in the native library.

My Environment

My Environment

Dependency Version
Operating System Ubuntu 24.04.3 LTS (x64), kernel 6.8.0-94-generic
CPU AMD Ryzen 7 7800X3D 8-Core Processor
GPU NVIDIA GeForce RTX 3090 Ti (24,564 MiB VRAM)
NVIDIA Driver 570.211.01
Runtime Bun 1.3.9 (also tested Node.js 22.22.0)
TypeScript version 5.9.3
node-llama-cpp version 3.14.5
Prebuilt binaries b7347

npx --yes node-llama-cpp inspect gpu output:

OS: Ubuntu 24.04.3 LTS (x64)
Node: 22.22.0 (x64)
TypeScript: 5.9.3

node-llama-cpp: 3.14.5
Prebuilt binaries: b7347

CUDA: available
Vulkan: available

CUDA device: NVIDIA GeForce RTX 3090 Ti
CUDA used VRAM: 1.1% (266.19MB/23.56GB)
CUDA free VRAM: 98.89% (23.3GB/23.56GB)

Vulkan device: AMD Ryzen 7 7800X3D 8-Core Processor (RADV RAPHAEL_MENDOCINO)
Vulkan used VRAM: 0.12% (60.76MB/47.22GB)
Vulkan free VRAM: 99.87% (47.16GB/47.22GB)

CPU model: AMD Ryzen 7 7800X3D 8-Core Processor
Math cores: 8
Used RAM: 6.76% (6.32GB/93.44GB)
Free RAM: 93.23% (87.12GB/93.44GB)
Used swap: 1.73% (142.25MB/8GB)
Max swap size: 8GB
mmap: supported

Additional Context

Why this matters for applications

Applications using node-llama-cpp for embeddings (like qmd) are run on developer machines alongside other GPU-consuming processes (ollama, ComfyUI, training jobs). The SIGABRT is unrecoverable -- there is no way to catch it in JavaScript, so applications cannot provide error messages, retry with CPU-only, or suggest workarounds to users.

The auto estimation gap

Looking at resolveModelGpuLayersOption.ts, the auto mode estimates VRAM requirements from GGUF metadata before loading. When VRAM is consumed by external processes between estimation and actual CUDA allocation, the estimate becomes stale. The troubleshooting docs acknowledge estimation inaccuracies ("The built-in estimation mechanism may overestimate requirements") but the dangerous direction is underestimation.

Possible mitigations (suggestions, not demands)

  1. Safety margin: Build a safety margin into the auto VRAM estimation (e.g., reserve 10-15% of estimated free VRAM)
  2. Catch and retry: Catch native CUDA allocation failures and retry with fewer layers (or fall back to CPU)
  3. Double-check: Use a second VRAM check after estimation but before actual allocation
  4. Conservative mode: Expose a "conservative" auto mode that leaves more VRAM headroom

Related issues

Relevant Features Used

  • Metal support
  • CUDA support
  • Vulkan support
  • Grammar
  • Function calling

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrequires triageRequires triaging

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions