Skip to content

Guardrails

This page serves as a catalog of guardrails. It explains the theory of operation and the specific configuration options for each type of guardrail. For in-depth usage instructions, see the Tutorials.

All guardrails operate on a common data model called the Entailment Frame. This, also, is documented here for each guardrail type.

Common Configuration

All guardrails share the same basic information fields.

Basic Information

  • Guardrail ID: A generated UUID used to programmatically refer to the guardrail. This ID identifies to the Python client which guardrail to run.
  • Application: The application container to which the guardrail belongs.
  • Guardrail Name: A human-readable label for this guardrail. This is not used programmatically.
  • Guardrail Type: The technique behind the guardrail; the rest of this page focuses on the different guardrail types.

Consensus

Theory of Operation

Consensus is an agentic guardrail that performs an LLM-as-a-Judge assessment of the query with several parallel LLM requests, consolidating their responses into a majority vote.

flowchart LR E[("Entailment Frame")] --> J1["LLM Judge"] E --> J2["LLM Judge"] E --> J3["LLM Judge"] J1 --> V["Vote"] V --> EVAL["Eval"] J2 --> V J3 --> V

Entailment Frame

  • Context: Optional background knowledge, e.g., RAG documents.
  • Question: The question to be answered about the context (or from an LLM's parametric knowledge).
  • Answer: An answer to the question to be verified.
  • Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
    • Confidence: The percentage of judges in agreement.
  • Proof: N/A

Configuration Options

Consensus Configuration

  • LLM: Which LLM to use for the judge. At this time, all judges use the same LLM.
  • Iterations: The number of judges to use.

Consistency Checking

Theory of Operation

When an LLM "knows" something, it tends to provide a consistent response. However, when information is missing from its parametric memory (and from the context), it will still answer with its highest-probability guess. This is a common source of hallucination. The consistency checking guardrail will rephrase the context, question, and answer and ask the LLM to determine entailment: does the answer follow from the context and question? If it does, the LLM will consistently give the correct answer; widely different answers indicate uncertainty in the answer.

A semantic similarity score is used to weight the rephrasing results. This score is determined by taking the cosine distance in an embedding space between the orginal frame and the rephrased version.

flowchart LR E[("Entailment Frame")] --> R["Rephrase"] R --> V1["Rephrased Frame"] R --> V2["Rephrased Frame"] R --> V3["Rephrased Frame"] V1 -->|Weight by Similarity| ENT["Assess Entailment"] V2 -->|Weight by Similarity| ENT V3 -->|Weight by Similarity| ENT ENT --> EVAL["Eval"]

Entailment Frame

  • Context: Optional background knowledge e.g. RAG documents.
  • Question: The question to be answered about the context (or from an LLM's parametric knowledge).
  • Answer: An answer to the question to be verified.
  • Eval: The overall assessment: Yes if the answer is entailed by the combination of context and question; No otherwise.
    • Confidence: The degree of entailment.
  • Proof: N/A

Configuration Options

Consistency Checking Configuration

  • LLM: Which LLM to use for the primary assessment of entailment.
  • Iterations: The number of rephrased samples to analyze.
  • Utility LLM: Which LLM to use to perform the rephrases. Using a weaker LLM is often helpful to encourage variety.
  • Embedding Model: Which embedding model to use for semantic similarity measurements between rephrasings.

Critique and Revise

Theory of Operation

Critique & Revise configures two adversarial agents that work together to verify the frame. The critique agent analyzes the context and question for entailment of the answer. Subsequently, the review agent will review the critique agent's assessment, and if it disagrees will provide feedback on why. This iterates until the two agents agree (or until a preset iteration limit has been reached). If this happens, the most recent critique result is returned as the final eval.

There is an additional counterfactual mode in which the critique agent attempts to generate counterfactuals that disprove entailment; it is more adversarial than the neutral critique agent in the default configuration.

flowchart LR E[("Entailment Frame")] --> C["Critique"] C --> R["Review"] R --> A{"Agreement?"} A -->|Yes| V["Eval"] A -->|No| I I{"Iteration Limit?"} -->|Yes| V I -->|No| C

Entailment Frame

  • Context: Optional background knowledge e.g. RAG documents.
  • Question: The question to be answered about the context (or from an LLM's parametric knowledge).
  • Answer: An answer to the question to be verified.
  • Eval: The overall assessment: Yes if the answer is entailed by the combination of context and question; No otherwise.
    • Confidence: N/A
  • Proof: N/A

Configuration Options

Critique & Review Configuration

  • LLM: Which LLM to use for the agents.
  • Iterations: The number of iterations before ending the "debate".
  • Counterfactuals: Check to enable counterfactual mode.

Human Review

Theory of Operation

Human Review is a manual review by human analysts, and uses no automation. The raw entailment frame is presented in the Human Review Dashboard as both input and output. Note this is the same Dashboard that is optionally used for review of automation-produced assessments in some other guardrails.

Entailment Frame

  • Context: Optional background knowledge e.g. RAG documents.
  • Question: The question to be answered about the context.
  • Answer: An answer to the question to be verified.
  • Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
    • Confidence: The percentage of judges in agreement.
  • Proof: Open per organization's review guidelines.

Configuration Options

N/A

LLM-as-a-Judge

Theory of Operation

LLM-as-a-Judge uses an LLM agent to assess the accuracy and quality of an input frame. This relies on the ability of the configured LLM to reason over the frame internally.

flowchart LR E[("Entailment Frame")] --> J1["LLM Judge"] J1 --> EVAL["Eval"]

Entailment Frame

  • Context: Optional background knowledge e.g. RAG documents.
  • Question: The question to be answered about the context (or from an LLM's parametric knowledge).
  • Answer: An answer to the question to be verified.
  • Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
    • Confidence: The percentage of judges in agreement.
  • Proof: N/A

Configuration Options

LLM-as-a-Judge Configuration

  • LLM: Which LLM to use for the judge.

Policy Rules

Theory of Operation

A common usage pattern for guardrails is to assess compliance of a chunk of input text against a written (natural language) policy.

When configuring a Policy Rules guardrail, we:

  1. Extract (from a complex policy document) a simplified list of (natural language) rules that demonstrate compliance.
  2. Formalize those rules into a domain specific language (DSL) that helps integrate an SMT solver with an LLM-powered system.
  3. Generate natural language questions to provide the inputs for the DSL program; these are designed to be straightforward for LLM-powered data extraction, or for human-powered data review, avoiding complex reasoning when processing a chunk.
flowchart LR P[("Policy Document")] --> R["Rules"] R --> D["DSL Code"] D --> Q["Questions"] D --> G["Guardrail"] Q --> G["Guardrail"]

At inference time:

  1. The questions are asked (to an LLM) of the input chunk. Note that as not all parts of the policy may be addressed in a given chunk, not all questions need be answered.
  2. (Optional) LLM answers can be reviewed and corrected by a human analyst.
  3. The DSL is executed using these answers as inputs.
flowchart LR E[("Entailment Frame")] -->|Questions| L["LLM"] L -->|Answers| S["Solver w/ DSL"] L -.->|Answers| H["Human Review Dashboard"] H -.-> S S --> EVAL[Eval]

Entailment Frame

  • Context: An input "chunk" to be analyzed for (non-)compliance with a configured policy.
  • Question: An understood "Is this consistent with policy X?" (Practically speaking: continued context)
  • Answer: N/A
  • Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
    • Confidence: The percentage of rules (DSL assertions) that have been satisfied.
  • Proof: A separate True/False (1.0/0.0) result for each assertion, corresponding to each policy rule. This allows a detailed assessment of why an overall assessment is Yes/No.

Configuration Options

Policy Rules Configuration

The following basic configuration items set up the Policy Rules configuration tool.

  • LLM: Which LLM to use for answer extraction at inference time.
  • Utility LLM: Which LLM to use for configuration-time rule extraction, code generation, and question generation.
  • Human Review Enabled: Whether to pause inference and route extracted questions to the Human Review Dashboard for verification and correction.

Additionally, there are a number of fields pertaining to the analysis and configuration of the policy rules. These are described in the Policy Rules Tutorial.