bug: gpuLayers "auto" causes unrecoverable CUDA SIGABRT when VRAM is consumed by external processes

### Issue description

`gpuLayers: "auto"` can underestimate VRAM pressure from concurrent GPU processes, causing an unrecoverable CUDA SIGABRT (exit 134) instead of a catchable error or graceful CPU fallback.

### Expected Behavior

When `gpuLayers: "auto"` cannot fit the model in available VRAM, the behavior should be one of:

1. Fall back to fewer GPU layers (or CPU-only) automatically
2. Throw a catchable `InsufficientMemoryError` (or similar) that the application can handle
3. At minimum, provide an error message before the process terminates


### Actual Behavior

The process is killed by a native CUDA abort (`SIGABRT`, exit code 134) with no opportunity for JavaScript error handling. The crash originates from `ggml-cuda.cu:96` in the native library. This is uncatchable in JavaScript -- no `try/catch`, no `process.on('uncaughtException')`, no `process.on('SIGABRT')` can intercept it.

The crash output looks like:

```
/home/runner/work/node-llama-cpp/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: CUDA error
[node-llama-cpp] CUDA error: out of memory
```

Followed by a native stack trace and `SIGABRT`.


### Steps to reproduce

**Methodology:**

1. We built a CUDA VRAM ballast tool to allocate precise amounts of VRAM, simulating a concurrent large model (e.g., a 70B parameter model running in ollama)
2. We ran `qmd embed` (which uses node-llama-cpp with `gpuLayers: "auto"` and the embeddinggemma model, ~300 MB) at different VRAM pressure levels
3. Test hardware: NVIDIA GeForce RTX 3090 Ti (24 GB VRAM)

**Minimal reproduction:**

```typescript
import { getLlama } from "node-llama-cpp";

// Precondition: another process is using most of the GPU VRAM
// (e.g., ollama running a 70B model, or use nvidia-smi to verify <400 MiB free)

const llama = await getLlama();
const model = await llama.loadModel({
  modelPath: "/path/to/embedding-model.gguf",
  // gpuLayers: "auto"  -- this is the default
});
const context = await model.createEmbeddingContext();
// → Process killed with SIGABRT exit code 134
```

**Test results at various VRAM pressure levels:**

| VRAM Pressure | Free VRAM | Behavior | Exit Code |
|---|---|---|---|
| None (baseline) | 24,123 MiB | Normal operation, 88% GPU util, 79.4 KB/s | 0 |
| 16 GB ballast | ~7,856 MiB | Works fine, 88% GPU util, 75.4 KB/s | 0 (data, not clean exit) |
| 22 GB ballast | ~1,856 MiB | OOM Killed by Linux OOM killer | 137 (SIGKILL) |
| 23 GB ballast | ~856 MiB | Partial GPU offload, 3.3x slower, data errors | 137 |
| 23.5 GB ballast | ~356 MiB | **CUDA OOM crash (SIGABRT)** | **134** |

**Key observation:** At 356 MiB free VRAM, `gpuLayers: "auto"` attempted to load layers to GPU despite insufficient memory. Instead of detecting the failure and falling back to CPU (or fewer layers), the native CUDA allocation failed with `cudaMalloc: out of memory`, which triggered `SIGABRT` in the native library.


### My Environment

## My Environment

| Dependency | Version |
| --- | --- |
| Operating System | Ubuntu 24.04.3 LTS (x64), kernel 6.8.0-94-generic |
| CPU | AMD Ryzen 7 7800X3D 8-Core Processor |
| GPU | NVIDIA GeForce RTX 3090 Ti (24,564 MiB VRAM) |
| NVIDIA Driver | 570.211.01 |
| Runtime | Bun 1.3.9 (also tested Node.js 22.22.0) |
| TypeScript version | 5.9.3 |
| `node-llama-cpp` version | 3.14.5 |
| Prebuilt binaries | b7347 |

`npx --yes node-llama-cpp inspect gpu` output:
```
OS: Ubuntu 24.04.3 LTS (x64)
Node: 22.22.0 (x64)
TypeScript: 5.9.3

node-llama-cpp: 3.14.5
Prebuilt binaries: b7347

CUDA: available
Vulkan: available

CUDA device: NVIDIA GeForce RTX 3090 Ti
CUDA used VRAM: 1.1% (266.19MB/23.56GB)
CUDA free VRAM: 98.89% (23.3GB/23.56GB)

Vulkan device: AMD Ryzen 7 7800X3D 8-Core Processor (RADV RAPHAEL_MENDOCINO)
Vulkan used VRAM: 0.12% (60.76MB/47.22GB)
Vulkan free VRAM: 99.87% (47.16GB/47.22GB)

CPU model: AMD Ryzen 7 7800X3D 8-Core Processor
Math cores: 8
Used RAM: 6.76% (6.32GB/93.44GB)
Free RAM: 93.23% (87.12GB/93.44GB)
Used swap: 1.73% (142.25MB/8GB)
Max swap size: 8GB
mmap: supported
```


### Additional Context

### Why this matters for applications

Applications using node-llama-cpp for embeddings (like [qmd](https://github.com/tobi/qmd)) are run on developer machines alongside other GPU-consuming processes (ollama, ComfyUI, training jobs). The SIGABRT is unrecoverable -- there is no way to catch it in JavaScript, so applications cannot provide error messages, retry with CPU-only, or suggest workarounds to users.

### The auto estimation gap

Looking at `resolveModelGpuLayersOption.ts`, the auto mode estimates VRAM requirements from GGUF metadata before loading. When VRAM is consumed by external processes between estimation and actual CUDA allocation, the estimate becomes stale. The troubleshooting docs acknowledge estimation inaccuracies ("The built-in estimation mechanism may overestimate requirements") but the dangerous direction is *underestimation*.

### Possible mitigations (suggestions, not demands)

1. **Safety margin:** Build a safety margin into the auto VRAM estimation (e.g., reserve 10-15% of estimated free VRAM)
2. **Catch and retry:** Catch native CUDA allocation failures and retry with fewer layers (or fall back to CPU)
3. **Double-check:** Use a second VRAM check after estimation but before actual allocation
4. **Conservative mode:** Expose a "conservative" auto mode that leaves more VRAM headroom

### Related issues

- #519 -- crash with `gpuLayers: "auto"` on Vulkan (similar failure mode on a different backend)
- #549 -- context creation failures (potentially related to VRAM estimation)


### Relevant Features Used

- [ ] Metal support
- [x] CUDA support
- [ ] Vulkan support
- [ ] Grammar
- [ ] Function calling

### Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: gpuLayers "auto" causes unrecoverable CUDA SIGABRT when VRAM is consumed by external processes #551

Issue description

Expected Behavior

Actual Behavior

Steps to reproduce

My Environment

My Environment

Additional Context

Why this matters for applications

The auto estimation gap

Possible mitigations (suggestions, not demands)

Related issues

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VRAM Pressure	Free VRAM	Behavior	Exit Code
None (baseline)	24,123 MiB	Normal operation, 88% GPU util, 79.4 KB/s	0
16 GB ballast	~7,856 MiB	Works fine, 88% GPU util, 75.4 KB/s	0 (data, not clean exit)
22 GB ballast	~1,856 MiB	OOM Killed by Linux OOM killer	137 (SIGKILL)
23 GB ballast	~856 MiB	Partial GPU offload, 3.3x slower, data errors	137
23.5 GB ballast	~356 MiB	CUDA OOM crash (SIGABRT)	134

Dependency	Version
Operating System	Ubuntu 24.04.3 LTS (x64), kernel 6.8.0-94-generic
CPU	AMD Ryzen 7 7800X3D 8-Core Processor
GPU	NVIDIA GeForce RTX 3090 Ti (24,564 MiB VRAM)
NVIDIA Driver	570.211.01
Runtime	Bun 1.3.9 (also tested Node.js 22.22.0)
TypeScript version	5.9.3
`node-llama-cpp` version	3.14.5
Prebuilt binaries	b7347

Uh oh!

bug: gpuLayers "auto" causes unrecoverable CUDA SIGABRT when VRAM is consumed by external processes #551

Description

Issue description

Expected Behavior

Actual Behavior

Steps to reproduce

My Environment

My Environment

Additional Context

Why this matters for applications

The auto estimation gap

Possible mitigations (suggestions, not demands)

Related issues

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions