vitest-evals

Evaluate LLM outputs using the familiar Vitest testing framework.

Installation

npm install -D vitest-evals

Quick Start

import { describeEval } from "vitest-evals";

describeEval("capital cities", {
  data: async () => [
    { input: "What is the capital of France?", expected: "Paris" },
    { input: "What is the capital of Japan?", expected: "Tokyo" }
  ],
  task: async (input) => {
    const response = await queryLLM(input);
    return response; // Simple string return
  },
  scorers: [async ({ output, expected }) => ({
    score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0
  })],
  threshold: 0.8
});

Tasks

Tasks process inputs and return outputs. Two formats are supported:

// Simple: just return a string
const task = async (input) => "response";

// With tool tracking: return a TaskResult
const task = async (input) => ({
  result: "response",
  toolCalls: [
    { name: "search", arguments: { query: "..." }, result: {...} }
  ]
});

Scorers

Scorers evaluate outputs and return a score (0-1). Use built-in scorers or create your own:

// Built-in scorer
import { ToolCallScorer } from "vitest-evals";
// Or import individually
import { ToolCallScorer } from "vitest-evals/scorers/toolCallScorer";

describeEval("tool usage", {
  data: async () => [
    { input: "Search weather", expectedTools: [{ name: "weather_api" }] }
  ],
  task: weatherTask,
  scorers: [ToolCallScorer()]
});

// Custom scorer
const LengthScorer = async ({ output }) => ({
  score: output.length > 50 ? 1.0 : 0.0
});

// TypeScript scorer with custom options
import { type ScoreFn, type BaseScorerOptions } from "vitest-evals";

interface CustomOptions extends BaseScorerOptions {
  minLength: number;
}

const TypedScorer: ScoreFn<CustomOptions> = async (opts) => ({
  score: opts.output.length >= opts.minLength ? 1.0 : 0.0
});

Built-in Scorers

ToolCallScorer

Evaluates if the expected tools were called with correct arguments.

// Basic usage - strict matching, any order
describeEval("search test", {
  data: async () => [{
    input: "Find Italian restaurants",
    expectedTools: [
      { name: "search", arguments: { type: "restaurant" } },
      { name: "filter", arguments: { cuisine: "italian" } }
    ]
  }],
  task: myTask,
  scorers: [ToolCallScorer()]
});

// Strict evaluation - exact order and parameters
scorers: [ToolCallScorer({ 
  ordered: true,      // Tools must be in exact order
  params: "strict"    // Parameters must match exactly
})]

// Flexible evaluation
scorers: [ToolCallScorer({
  requireAll: false,   // Partial matches give partial credit
  allowExtras: false   // No additional tools allowed
})]

Default behavior:

Strict parameter matching (exact equality required)
Any order allowed
Extra tools allowed
All expected tools required

AI SDK Integration

See src/ai-sdk-integration.test.ts for a complete example with the Vercel AI SDK.

Transform provider responses to our format:

// Vercel AI SDK
const { text, toolCalls, toolResults } = await generateText(...);
return {
  result: text,
  toolCalls: toolCalls?.map((call, i) => ({
    id: call.toolCallId,
    name: call.toolName,
    arguments: call.args,
    result: toolResults?.[i]?.result,
    status: toolResults?.[i]?.error ? 'failed' : 'completed'
  }))
};

Advanced Usage

Advanced Scorers

Using autoevals

For sophisticated evaluation, use autoevals scorers:

import { Factuality, ClosedQA } from "autoevals";

scorers: [
  Factuality, // LLM-based factuality checking
  ClosedQA.partial({
    criteria: "Does the answer mention Paris?"
  })
]

Custom LLM-based Factuality Scorer

Here's an example of implementing your own LLM-based factuality scorer using the Vercel AI SDK:

import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

const Factuality = (model = openai('gpt-4o')) => async ({ input, output, expected }) => {
  if (!expected) {
    return { score: 1.0, metadata: { rationale: "No expected answer" } };
  }

  const { object } = await generateObject({
    model,
    prompt: `
      Compare the factual content of the submitted answer with the expert answer.
      
      Question: ${input}
      Expert: ${expected}
      Submission: ${output}
      
      Options:
      (A) Subset of expert answer
      (B) Superset of expert answer  
      (C) Same content as expert
      (D) Contradicts expert answer
      (E) Different but factually equivalent
    `,
    schema: z.object({
      answer: z.enum(['A', 'B', 'C', 'D', 'E']),
      rationale: z.string()
    })
  });

  const scores = { A: 0.4, B: 0.6, C: 1, D: 0, E: 1 };
  return {
    score: scores[object.answer],
    metadata: { rationale: object.rationale, answer: object.answer }
  };
};

// Usage
scorers: [Factuality()]

Skip Tests Conditionally

describeEval("gpt-4 tests", {
  skipIf: () => !process.env.OPENAI_API_KEY,
  // ...
});

Existing Test Suites

import "vitest-evals";

test("capital check", () => {
  const simpleFactuality = async ({ output, expected }) => ({
    score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0
  });
  
  expect("What is the capital of France?").toEval(
    "Paris",
    answerQuestion,
    simpleFactuality,
    0.8
  );
});

Configuration

Separate Eval Configuration

Create vitest.evals.config.ts:

import { defineConfig } from "vitest/config";
import defaultConfig from "./vitest.config";

export default defineConfig({
  ...defaultConfig,
  test: {
    ...defaultConfig.test,
    include: ["src/**/*.eval.{js,ts}"],
  },
});

Run evals separately:

vitest --config=vitest.evals.config.ts

Development

npm install
npm test

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
scripts		scripts
src		src
.craft.yml		.craft.yml
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

vitest-evals

Installation

Quick Start

Tasks

Scorers

Built-in Scorers

ToolCallScorer

AI SDK Integration

Advanced Usage

Advanced Scorers

Using autoevals

Custom LLM-based Factuality Scorer

Skip Tests Conditionally

Existing Test Suites

Configuration

Separate Eval Configuration

Development

About

Uh oh!

Releases 4

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

Uh oh!

License

getsentry/vitest-evals

Folders and files

Latest commit

History

Repository files navigation

vitest-evals

Installation

Quick Start

Tasks

Scorers

Built-in Scorers

ToolCallScorer

AI SDK Integration

Advanced Usage

Advanced Scorers

Using autoevals

Custom LLM-based Factuality Scorer

Skip Tests Conditionally

Existing Test Suites

Configuration

Separate Eval Configuration

Development

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages