Guardrails
This page serves as a catalog of guardrails. It explains the theory of operation and the specific configuration options for each type of guardrail. For in-depth usage instructions, see the Tutorials.
All guardrails operate on a common data model called the Entailment Frame. This is also documented here for each guardrail type.
Common Configuration
All guardrails share the same basic information fields.
- Guardrail ID: A generated UUID used to programmatically refer to the guardrail. This ID identifies to the Python client which guardrail to run.
- Application: The application container to which the guardrail belongs.
- Guardrail Name: A human-readable label for this guardrail. This is not used programmatically.
- Guardrail Type: The technique behind the guardrail; the rest of this page focuses on the different guardrail types.
Consensus
Theory of Operation
Consensus is an agentic guardrail that performs an LLM-as-a-Judge assessment of the query with several parallel LLM requests, consolidating their responses into a majority vote.
Entailment Frame
- Context: Optional background knowledge, e.g., RAG documents.
- Question: The question to be answered about the context (or from an LLM's parametric knowledge).
- Answer: An answer to the question to be verified.
- Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
- Confidence: The percentage of judges in agreement.
- Proof: N/A
Configuration Options
- LLM: Which LLM to use for the judge. At this time, all judges use the same LLM.
- Iterations: The number of judges to use.
Consistency Checking
Theory of Operation
When an LLM "knows" something, it tends to provide a consistent response. However, when information is missing from its parametric memory (and from the context), it will still answer with its highest-probability guess. This is a common source of hallucination. The consistency checking guardrail will rephrase the context, question, and answer and ask the LLM to determine entailment: does the answer follow from the context and question? If it does, the LLM will consistently give the correct answer; widely different answers indicate uncertainty in the answer.
A semantic similarity score is used to weight the rephrasing results. This score is determined by taking the cosine distance in an embedding space between the original frame and the rephrased version.
Entailment Frame
- Context: Optional background knowledge e.g. RAG documents.
- Question: The question to be answered about the context (or from an LLM's parametric knowledge).
- Answer: An answer to the question to be verified.
- Eval: The overall assessment: Yes if the answer is entailed by the combination of context and question; No otherwise.
- Confidence: The degree of entailment.
- Proof: N/A
Configuration Options
- LLM: Which LLM to use for the primary assessment of entailment.
- Iterations: The number of rephrased samples to analyze.
- Utility LLM: Which LLM to use to perform the rephrases. Using a weaker LLM is often helpful to encourage variety.
- Embedding Model: Which embedding model to use for semantic similarity measurements between rephrasings.
Critique & Revise
Theory of Operation
Critique & Revise configures two adversarial agents that work together to verify the frame. The critique agent analyzes the context and question for entailment of the answer. Subsequently, the review agent will review the critique agent's assessment, and if it disagrees will provide feedback on why. This iterates until the two agents agree (or until a preset iteration limit has been reached). If this happens, the most recent critique result is returned as the final eval.
There is an additional counterfactual mode in which the critique agent attempts to generate counterfactuals that disprove entailment; it is more adversarial than the neutral critique agent in the default configuration.
Entailment Frame
- Context: Optional background knowledge e.g. RAG documents.
- Question: The question to be answered about the context (or from an LLM's parametric knowledge).
- Answer: An answer to the question to be verified.
- Eval: The overall assessment: Yes if the answer is entailed by the combination of context and question; No otherwise.
- Confidence: N/A
- Proof: N/A
Configuration Options
- LLM: Which LLM to use for the agents.
- Iterations: The number of iterations before ending the "debate".
- Counterfactuals: Check to enable counterfactual mode.
Human Review
Theory of Operation
Human Review is a manual review by human analysts, and uses no automation. The raw entailment frame is presented in the Human Review Dashboard as both input and output. Note: this is the same Dashboard that is optionally used for review of automation-produced assessments in some other guardrails.
Entailment Frame
- Context: Optional background knowledge e.g. RAG documents.
- Question: The question to be answered about the context.
- Answer: An answer to the question to be verified.
- Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
- Confidence: N/A
- Proof: Open per organization's review guidelines.
Configuration Options
N/A
LLM-as-a-Judge
Theory of Operation
LLM-as-a-Judge uses an LLM agent to assess the accuracy and quality of an input frame. This relies on the ability of the configured LLM to reason over the frame internally.
Entailment Frame
- Context: Optional background knowledge e.g. RAG documents.
- Question: The question to be answered about the context (or from an LLM's parametric knowledge).
- Answer: An answer to the question to be verified.
- Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
- Confidence: The percentage of judges in agreement.
- Proof: N/A
Configuration Options
- LLM: Which LLM to use for the judge.
Policy Rules
Theory of Operation
A common usage pattern for guardrails is to assess compliance of a chunk of input text against a written (natural language) policy.
When configuring a Policy Rules guardrail, we:
- Extract (from a complex policy document) a simplified list of (natural language) rules that demonstrate compliance.
- Formalize those rules into a domain specific language (DSL) that helps integrate an SMT solver with an LLM-powered system.
- Generate natural language questions to provide the inputs for the DSL program; these are designed to be straightforward for LLM-powered data extraction, or for human-powered data review, avoiding complex reasoning when processing a chunk.
Rule Extraction Methods
The system supports two methods for extracting rules from policy documents:
-
Basic Method: Uses a simple LLM prompt to extract rules directly from the entire policy document.
-
RAG Method: Uses a sophisticated Retrieval-Augmented Generation (RAG) pipeline that:
- Chunks the policy document using semantic or size-based segmentation
- Extracts rules from each chunk in parallel
- Performs intelligent deduplication using semantic similarity
- Filters rules for topical relevance using advanced similarity algorithms
Filtering Strategies
When using the RAG method, you can choose between two filtering strategies to ensure extracted rules are relevant to your query theme:
-
Embedding Similarity: Uses cosine similarity between rule embeddings and query embeddings to filter rules. This approach is fast and works well for general topical filtering.
-
Provence Reranker: Uses a specialized neural reranker model (
naver/provence-reranker-debertav3-v1
) that analyzes the semantic relationship between rules and the query. This approach provides more accurate filtering by understanding context and meaning at a deeper level, but requires additional computational resources for the first-time model download.
Both strategies use the same relevance threshold (0.0-1.0), where higher values result in more strict filtering and lower values are more permissive.
At inference time:
- The questions are asked (to an LLM) of the input chunk. Note that as not all parts of the policy may be addressed in a given chunk, not all questions need be answered.
- (Optional) LLM answers can be reviewed and corrected by a human analyst.
- The DSL is executed using these answers as inputs.
Entailment Frame
- Context: An input "chunk" to be analyzed for (non-)compliance with a configured policy.
- Question: An understood "Is this consistent with policy X?" (Practically speaking: continued context)
- Answer: N/A
- Eval: The overall assessment: Yes if the answer is deemed correct; No otherwise.
- Confidence: The percentage of rules (DSL assertions) that have been satisfied.
- Proof: A separate True/False (
1.0
/0.0
) result for each assertion, corresponding to each policy rule. This allows a detailed assessment of why an overall assessment is Yes/No.
Configuration Options
The following basic configuration items set up the Policy Rules configuration tool.
- LLM: Which LLM to use for answer extraction at inference time.
- Utility LLM: Which LLM to use for configuration-time rule extraction, code generation, and question generation.
- Human Review Enabled: Whether to pause inference and route extracted questions to the Human Review Dashboard for verification and correction.
Additionally, there are a number of fields pertaining to the analysis and configuration of the policy rules. These are described in the Policy Rules Tutorial.
Complex Workflows
For advanced verification scenarios that require multiple guardrails working together, see the Applications documentation. Applications enable you to create DAG (Directed Acyclic Graph) workflows that orchestrate multiple guardrails with custom logic and dependencies.
Testing Guardrails
For interactive testing of individual guardrails and applications, see the Playground documentation. The Playground provides a user-friendly interface for testing verification scenarios without writing code.