Datasets

Datasets are collections of Documents assembled for batch testing. They answer the question: "How does my ruleset perform across a range of inputs?"

Rather than testing one document at a time, a dataset lets you run a ruleset against many documents in a single batch, producing results you can compare and analyze systematically.

What Datasets Are For

With a dataset, you can:

Batch test -- Run a ruleset against many documents at once instead of one at a time
Measure consistency -- Use variance testing to see how stable results are across repeated runs
Organize test data -- Group documents by scenario, compliance area, or testing purpose
Iterate systematically -- Modify DSAIL rules or questions, re-run against the same dataset, and compare results

How Datasets Work

The typical workflow for using datasets is:

Create a dataset with a name and description
Add documents from your project's document library
Optionally generate synthetic records to expand coverage
Run tests against the dataset using Runs

Gold and Silver Documents

Documents in a dataset are classified by quality tier:

Gold documents: Documents you created or uploaded yourself. Gold data is considered authoritative because a human has provided or reviewed the content.
Silver documents: Synthetic records generated by an LLM. Silver data is useful for expanding a dataset quickly, but should be reviewed for accuracy since it is machine-generated.

Working with Datasets

The Datasets page shows all datasets in your project.

Datasets page

Creating a Dataset

Navigate to the Datasets page using the sidebar
Click New Dataset
Enter a name and optional description

Adding Documents

After creating a dataset, open it to view its contents. Documents are organized into Gold and Silver sections. Click Add Documents to select documents from your project's library. You can add the same document to multiple datasets.

Importing Documents

You can create documents in bulk by importing a CSV file. Each row in the CSV becomes a new document added to the dataset as gold data.

To import documents:

Open an existing dataset by clicking on it
Click Import CSV in the dataset detail modal
Select a CSV file where each row represents a new document
The platform creates a new document for each row and adds it to the dataset

This is useful when you have a large number of test cases prepared in a spreadsheet and want to load them all at once rather than creating each document individually.

Generating Synthetic Records

To expand a dataset without manually creating every document, use synthetic data generation. The dataset must already contain at least one gold or silver record to serve as a basis for generation.

Click Generate Records from the dataset's menu
Select the LLM model to use for generation
Choose how many records to create (1, 10, or 100)
Optionally check "Gold only" to generate from gold documents only

Generated records are tagged as silver data and should be reviewed for accuracy.

Rulesets define the rules that are tested against dataset documents
Documents provide the content that appears in datasets
Runs execute rulesets against datasets for batch testing
DSAIL Language defines the assertions evaluated during test runs