Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/ref/checks/custom_prompt_check.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
- **`model`** (required): Model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Custom instructions defining the content detection criteria
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## Implementation Notes

Expand All @@ -42,3 +46,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether the custom validation criteria were met
- **`confidence`**: Confidence score (0.0 to 1.0) for the validation
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
23 changes: 15 additions & 8 deletions docs/ref/checks/hallucination_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7,
"knowledge_source": "vs_abc123"
"knowledge_source": "vs_abc123",
"include_reasoning": false
}
}
```
Expand All @@ -24,6 +25,10 @@ Flags model text containing factual claims that are clearly contradicted or not
- **`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
- When `false`: Returns only `flagged` and `confidence` to save tokens
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
- Recommended: Keep disabled for production (default); enable for development/debugging

### Tuning guidance

Expand Down Expand Up @@ -102,7 +107,9 @@ See [`examples/hallucination_detection/`](https://github.com/openai/openai-guard

## What It Returns

Returns a `GuardrailResult` with the following `info` dictionary:
Returns a `GuardrailResult` with the following `info` dictionary.

**With `include_reasoning=true`:**

```json
{
Expand All @@ -117,15 +124,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

### Fields

- **`flagged`**: Whether the content was flagged as potentially hallucinated
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`reasoning`**: Explanation of why the content was flagged
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported
- **`verified_statements`**: Statements that are supported by your documents
- **`threshold`**: The confidence threshold that was configured

Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
- **`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
- **`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*

## Benchmark Results

Expand Down
9 changes: 7 additions & 2 deletions docs/ref/checks/jailbreak.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
"name": "Jailbreak",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -42,6 +43,10 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

### Tuning guidance

Expand Down Expand Up @@ -70,7 +75,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether a jailbreak attempt was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged)
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
- **`used_conversation_history`**: Boolean indicating whether conversation history was analyzed
- **`checked_text`**: JSON payload containing the conversation history and latest input that was analyzed

Expand Down
7 changes: 6 additions & 1 deletion docs/ref/checks/llm_base.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
"name": "LLM Base",
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -18,6 +19,10 @@ Base configuration for LLM-based guardrails. Provides common configuration optio

- **`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `true`: The LLM generates and returns detailed reasoning for its decisions (e.g., `reason`, `reasoning`, `observation`, `evidence` fields)
- When `false`: The LLM only returns the essential fields (`flagged` and `confidence`), reducing token generation costs
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## What It Does

Expand Down
5 changes: 5 additions & 0 deletions docs/ref/checks/nsfw.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

### Tuning guidance

Expand All @@ -51,6 +55,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether NSFW content was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

### Examples

Expand Down
9 changes: 7 additions & 2 deletions docs/ref/checks/off_topic_prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
- **`model`** (required): Model to use for analysis (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Description of your business scope and acceptable topics
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## Implementation Notes

Expand All @@ -39,6 +43,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`flagged`**: Whether the content aligns with your business scope
- **`confidence`**: Confidence score (0.0 to 1.0) for the prompt injection detection assessment
- **`flagged`**: Whether the content is off-topic (outside your business scope)
- **`confidence`**: Confidence score (0.0 to 1.0) for the assessment
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
12 changes: 10 additions & 2 deletions docs/ref/checks/prompt_injection_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
"name": "Prompt Injection Detection",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -40,6 +41,10 @@ After tool execution, the prompt injection detection check validates that the re

- **`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include the `observation` and `evidence` fields in the output (default: `false`)
- When `true`: Returns detailed `observation` explaining what the action is doing and `evidence` with specific quotes/details
- When `false`: Omits reasoning fields to save tokens (typically 100-300 tokens per check)
- Recommended: Keep disabled for production (default); enable for development/debugging

**Flags as MISALIGNED:**

Expand Down Expand Up @@ -77,13 +82,16 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`observation`**: What the AI action is doing
- **`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
- **`flagged`**: Whether the action is misaligned (boolean)
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
- **`evidence`**: Specific evidence from conversation supporting the decision - *only included when `include_reasoning=true`*
- **`threshold`**: The confidence threshold that was configured
- **`user_goal`**: The tracked user intent from conversation
- **`action`**: The list of function calls or tool outputs analyzed for alignment

**Note**: When `include_reasoning=false` (the default), the `observation` and `evidence` fields are omitted to reduce token generation costs.

## Benchmark Results

### Dataset Description
Expand Down
60 changes: 37 additions & 23 deletions src/guardrails/checks/text/hallucination_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@ class HallucinationDetectionOutput(LLMOutput):
Extends the base LLM output with hallucination-specific details.

Attributes:
flagged (bool): Whether the content was flagged as potentially hallucinated.
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated.
flagged (bool): Whether the content was flagged as potentially hallucinated (inherited).
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated (inherited).
reasoning (str): Detailed explanation of the analysis.
hallucination_type (str | None): Type of hallucination detected.
hallucinated_statements (list[str] | None): Specific statements flagged as
Expand All @@ -104,16 +104,6 @@ class HallucinationDetectionOutput(LLMOutput):
by the documents.
"""

flagged: bool = Field(
...,
description="Indicates whether the content was flagged as potentially hallucinated.",
)
confidence: float = Field(
...,
description="Confidence score (0.0 to 1.0) that the input is hallucinated.",
ge=0.0,
le=1.0,
)
reasoning: str = Field(
...,
description="Detailed explanation of the hallucination analysis.",
Expand Down Expand Up @@ -184,14 +174,6 @@ class HallucinationDetectionOutput(LLMOutput):
3. **Clearly contradicted by the documents** - Claims that directly contradict the documents → FLAG
4. **Completely unsupported by the documents** - Claims that cannot be verified from the documents → FLAG

Respond with a JSON object containing:
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
- "reasoning": string (detailed explanation of your analysis)
- "hallucination_type": string (type of issue, if detected: "factual_error", "unsupported_claim", or "none" if supported)
- "hallucinated_statements": array of strings (specific factual statements that may be hallucinated)
- "verified_statements": array of strings (specific factual statements that are supported by the documents)

**CRITICAL GUIDELINES**:
- Flag content if ANY factual claims are unsupported or contradicted (even if some claims are supported)
- Allow conversational, opinion-based, or general content to pass through
Expand All @@ -206,6 +188,30 @@ class HallucinationDetectionOutput(LLMOutput):
).strip()


# Instruction for output format when reasoning is enabled
REASONING_OUTPUT_INSTRUCTION = textwrap.dedent(
"""
Respond with a JSON object containing:
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
- "reasoning": string (detailed explanation of your analysis)
- "hallucination_type": string (type of issue, if detected: "factual_error", "unsupported_claim", or "none" if supported)
- "hallucinated_statements": array of strings (specific factual statements that may be hallucinated)
- "verified_statements": array of strings (specific factual statements that are supported by the documents)
"""
).strip()


# Instruction for output format when reasoning is disabled
BASE_OUTPUT_INSTRUCTION = textwrap.dedent(
"""
Respond with a JSON object containing:
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
"""
).strip()


async def hallucination_detection(
ctx: GuardrailLLMContextProto,
candidate: str,
Expand Down Expand Up @@ -242,15 +248,23 @@ async def hallucination_detection(
)

try:
# Create the validation query
validation_query = f"{VALIDATION_PROMPT}\n\nText to validate:\n{candidate}"
# Build the prompt based on whether reasoning is requested
if config.include_reasoning:
output_instruction = REASONING_OUTPUT_INSTRUCTION
output_format = HallucinationDetectionOutput
else:
output_instruction = BASE_OUTPUT_INSTRUCTION
output_format = LLMOutput

# Create the validation query with appropriate output instructions
validation_query = f"{VALIDATION_PROMPT}\n\n{output_instruction}\n\nText to validate:\n{candidate}"

# Use the Responses API with file search and structured output
response = await _invoke_openai_callable(
ctx.guardrail_llm.responses.parse,
input=validation_query,
model=config.model,
text_format=HallucinationDetectionOutput,
text_format=output_format,
tools=[{"type": "file_search", "vector_store_ids": [config.knowledge_source]}],
)

Expand Down
17 changes: 5 additions & 12 deletions src/guardrails/checks/text/jailbreak.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,6 @@
import textwrap
from typing import Any

from pydantic import Field

from guardrails.registry import default_spec_registry
from guardrails.spec import GuardrailSpecMetadata
from guardrails.types import GuardrailLLMContextProto, GuardrailResult, token_usage_to_dict
Expand All @@ -50,6 +48,7 @@
LLMConfig,
LLMErrorOutput,
LLMOutput,
LLMReasoningOutput,
create_error_result,
run_llm,
)
Expand Down Expand Up @@ -226,15 +225,6 @@
MAX_CONTEXT_TURNS = 10


class JailbreakLLMOutput(LLMOutput):
"""LLM output schema including rationale for jailbreak classification."""

reason: str = Field(
...,
description=("Justification for why the input was flagged or not flagged as a jailbreak."),
)


def _build_analysis_payload(conversation_history: list[Any] | None, latest_input: str) -> str:
"""Return a JSON payload with recent turns and the latest input."""
trimmed_input = latest_input.strip()
Expand All @@ -251,12 +241,15 @@ async def jailbreak(ctx: GuardrailLLMContextProto, data: str, config: LLMConfig)
conversation_history = getattr(ctx, "get_conversation_history", lambda: None)() or []
analysis_payload = _build_analysis_payload(conversation_history, data)

# Use LLMReasoningOutput (with reason) if reasoning is enabled, otherwise use base LLMOutput
output_model = LLMReasoningOutput if config.include_reasoning else LLMOutput

analysis, token_usage = await run_llm(
analysis_payload,
SYSTEM_PROMPT,
ctx.guardrail_llm,
config.model,
JailbreakLLMOutput,
output_model,
)

if isinstance(analysis, LLMErrorOutput):
Expand Down
Loading
Loading