Skip to content

Datasets

Datasets are collections of Documents assembled for batch testing. They answer the question: "How does my ruleset perform across a range of inputs?"

Rather than testing one document at a time, a dataset lets you run a ruleset against many documents in a single batch, producing results you can compare and analyze systematically.

What Datasets Are For

With a dataset, you can:

  • Batch test -- Run a ruleset against many documents at once instead of one at a time
  • Measure consistency -- Use variance testing to see how stable results are across repeated runs
  • Organize test data -- Group documents by scenario, compliance area, or testing purpose
  • Iterate systematically -- Modify DSAIL rules or questions, re-run against the same dataset, and compare results

How Datasets Work

The typical workflow for using datasets is:

  1. Create a dataset with a name and description
  2. Add documents from your project's document library
  3. Optionally generate synthetic records to expand coverage
  4. Run tests against the dataset using Runs

Gold and Silver Documents

Documents in a dataset are classified by quality tier:

Gold documents
Documents you created or uploaded yourself. Gold data is considered authoritative because a human has provided or reviewed the content.
Silver documents
Synthetic records generated by an LLM. Silver data is useful for expanding a dataset quickly, but should be reviewed for accuracy since it is machine-generated.

Working with Datasets

The Datasets page shows all datasets in your project.

Datasets page

Creating a Dataset

  1. Navigate to the Datasets page using the sidebar
  2. Click New Dataset
  3. Enter a name and optional description

Adding Documents

After creating a dataset, open it to view its contents. Documents are organized into Gold and Silver sections. Click Add Documents to select documents from your project's library. You can add the same document to multiple datasets.

Importing Documents

You can create documents in bulk by importing a CSV file. Each row in the CSV becomes a new document added to the dataset as gold data.

To import documents:

  1. Open an existing dataset by clicking on it
  2. Click Import CSV in the dataset detail modal
  3. Select a CSV file where each row represents a new document
  4. The platform creates a new document for each row and adds it to the dataset

This is useful when you have a large number of test cases prepared in a spreadsheet and want to load them all at once rather than creating each document individually.

Generating Synthetic Records

To expand a dataset without manually creating every document, use synthetic data generation. The dataset must already contain at least one gold or silver record to serve as a basis for generation.

  1. Click Generate Records from the dataset's menu
  2. Select the LLM model to use for generation
  3. Choose how many records to create (1, 10, or 100)
  4. Optionally check "Gold only" to generate from gold documents only

Generated records are tagged as silver data and should be reviewed for accuracy.

  • Rulesets define the rules that are tested against dataset documents
  • Documents provide the content that appears in datasets
  • Runs execute rulesets against datasets for batch testing
  • DSAIL Language defines the assertions evaluated during test runs