feat: add tool call support and ToolCallScorer #20

dcramer · 2025-07-12T19:09:27Z

Summary

This PR improves the vitest-evals library by:

Removing the misleading Factuality scorer (was just string comparison)
Redesigning the ToolCallScorer API for better separation of concerns
Fixing strict parameter comparison to be order-independent

Changes

ToolCallScorer API Redesign

The scorer now cleanly separates tool matching from parameter matching:

Configuration Options:

ordered (default: false) - Whether tools must be called in exact order
requireAll (default: true) - Whether all expected tools must be called
allowExtras (default: true) - Whether to allow additional tool calls
params (default: "strict") - How to match parameters:
- "strict" - Deep equality (order-independent for objects)
- "fuzzy" - Case-insensitive, subset matching, numeric tolerance
- Custom function - Your own comparison logic

Key Improvements:

Test data now defines WHAT tools are expected (via expectedTools)
Scorer config defines HOW to evaluate them
Clearer separation between tool-level and parameter-level concerns
More predictable defaults (strict matching)
Fixed JSON.stringify issue - strict comparison now properly handles object key order

Example Usage:

// Define expected tools in test data
describeEval("tool usage", {
  data: async () => [{
    input: "Search for restaurants",
    expectedTools: [
      { name: "search", arguments: { type: "restaurant" } },
      { name: "filter", arguments: { cuisine: "italian" } }
    ]
  }],
  task: myTask,
  scorers: [
    ToolCallScorer({ params: "fuzzy" }) // Flexible matching
  ]
});

Breaking Changes

Default parameter matching is now strict (was fuzzy)
expectedTools moved from scorer config to test data
Renamed options for clarity:
- requireAllTools → requireAll
- allowExtraTools → allowExtras
- strictArgs → params: "strict"

Migration Guide

// Old
ToolCallScorer({
  tools: [...],
  strictArgs: true,
  allowExtraTools: false
})

// New
// Tools go in test data's expectedTools
ToolCallScorer({
  params: "strict", // default now
  allowExtras: false
})

Commits

Remove Factuality scorer and redesign ToolCallScorer API - Separated data from config
Fix extensibility issues in configuration - Improved naming consistency
Improve ToolCallScorer configuration API - Better separation of concerns, clearer names
Fix strict equality comparison - Replaced JSON.stringify with proper deep equality that handles object key order

- Add comprehensive ToolCall type supporting multiple LLM providers - Add ToolCallScorer for evaluating tool usage patterns - Update TaskResult to support both string and structured responses - Improve type system with BaseScorerOptions and generic ScoreFn - Add AI SDK integration examples - Simplify README and move detailed docs to README-old.md - Show Factuality scorer as example (not built-in) to keep library lightweight BREAKING CHANGE: None - maintains full backward compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

codecov · 2025-07-12T19:10:07Z

Codecov Report

Attention: Patch coverage is 90.67164% with 25 lines in your changes missing coverage. Please review.

Project coverage is 84.50%. Comparing base (1c70cf3) to head (37eb70f).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/scorers/toolCallScorer.ts	92.85%	18 Missing ⚠️
src/index.ts	53.33%	7 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #20       +/-   ##
===========================================
+ Coverage   69.34%   84.50%   +15.15%     
===========================================
  Files           2        4        +2     
  Lines         137      400      +263     
  Branches       28      115       +87     
===========================================
+ Hits           95      338      +243     
- Misses         42       62       +20

Flag	Coverage Δ
unittests	`84.50% <90.67%> (+15.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Separate data expectations from evaluation logic - Expected tools now defined in test data with arguments - Scorer config focuses on HOW to evaluate (ordered, strictArgs, etc) - Add smart fuzzy matching by default: - Case-insensitive strings - Numeric tolerance (0.1% or 0.001) - Object/array subset matching - Add configuration options: - ordered: require exact sequence - strictArgs: exact argument matching - allowExtraTools: control extra tool tolerance - requireAllTools: enable partial credit - argMatcher: custom comparison function - Improve error messages and metadata - Add comprehensive tests and examples - Update AI SDK integration examples BREAKING CHANGE: expectedTools format changed from string[] to { name: string, arguments?: any }[] and moved from scorer config to test data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Separate tool matching concerns from parameter matching - Add 'params' option for parameter matching strategy ('strict' | 'fuzzy' | custom) - Change default to strict parameter matching for better predictability - Rename 'requireAllTools' to clearer 'requireAll' - Rename 'allowExtraTools' to shorter 'allowExtras' - Keep 'ordered' for tool order matching BREAKING CHANGE: Default parameter matching is now strict instead of fuzzy. Use params: 'fuzzy' to restore previous behavior.

- Implement order-independent deep equality comparison - Handle arrays, objects, primitives, null/undefined correctly - Add test to verify object key order doesn't affect comparison - Fix edge case test to use fuzzy params as intended

- Fix expectedTools example to use object array instead of string array - Rename internal allowExtraTools to allowExtras for consistency - Improves clarity and reduces confusion in the API

- Pass requireAll parameter to evaluateOrderedTools function - Add partial credit logic when requireAll is false in ordered mode - Add test to verify partial credit works correctly in ordered mode - Makes ordered and unordered modes consistent in handling requireAll

Copilot

Pull Request Overview

This PR adds support for evaluating LLM tool calls by introducing the ToolCallScorer, redesigning its API, and updating the core describeEval runner and build configuration.

Update build config to ignore test files during bundling
Implement a configurable ToolCallScorer with strict/fuzzy matching and ordered/unordered tool evaluation
Extend describeEval and ScoreFn to pass through and include toolCalls in test metadata

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tsup.config.ts	Exclude test files from the bundled entry
src/scorers/toolCallScorer.ts	Add new `ToolCallScorer` logic with deep‐equality and fuzzy matching
src/index.ts	Update `describeEval`, `ScoreFn`, and `TaskFn` to handle `toolCalls`
src/scorers/index.ts	Re-export updated scorer types
package.json	Add exports for `scorers` directory and adjust `files` field
README.md	Refresh documentation and examples for tool call support

Comments suppressed due to low confidence (4)

tsup.config.ts:4

The tsup config currently excludes .test.ts and .test.*.ts files but not *.spec.ts files. Consider adding "!src/**/*.spec.ts" to prevent bundling spec tests in production builds.

  entry: ["src/**/*.ts", "!src/**/*.test.ts", "!src/**/*.test.*.ts"],

README.md:104

[nitpick] These example snippets aren't wrapped in a fenced code block, so they may not render correctly. Enclose them in triple backticks (e.g., ```javascript) before and after to ensure proper formatting.

scorers: [ToolCallScorer({

README.md:23

[nitpick] In this Quick Start example, response could be an object from queryLLM. If you intend to return a raw string, consider using response.text or clarifying the shape of response in the example.

    return response; // Simple string return

src/scorers/toolCallScorer.ts:221

[nitpick] Consider adding a JSDoc comment for this internal helper to explain its parameters (expected, actual, options) and return value, which will aid future maintainability.

function evaluateOrderedTools(

dcramer · 2025-07-13T00:16:37Z

May end up changing this after we put it into use but gonna ship this first pass.

This comment was marked as resolved.

Sign in to view

dcramer and others added 2 commits July 12, 2025 12:44

dcramer requested a review from Copilot July 12, 2025 23:35

This comment was marked as outdated.

Sign in to view

dcramer requested a review from Copilot July 12, 2025 23:44

This comment was marked as outdated.

Sign in to view

fix: Correct README example and unify allowExtras naming

61b687a

- Fix expectedTools example to use object array instead of string array - Rename internal allowExtraTools to allowExtras for consistency - Improves clarity and reduces confusion in the API

dcramer requested a review from Copilot July 12, 2025 23:47

This comment was marked as outdated.

Sign in to view

dcramer requested a review from Copilot July 12, 2025 23:54

This comment was marked as outdated.

Sign in to view

improve test coverage

37eb70f

dcramer requested a review from Copilot July 13, 2025 00:10

Copilot AI reviewed Jul 13, 2025

View reviewed changes

dcramer merged commit 9754a44 into main Jul 13, 2025
9 checks passed

dcramer deleted the better-scorers-tools branch July 13, 2025 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add tool call support and ToolCallScorer #20

feat: add tool call support and ToolCallScorer #20

dcramer commented Jul 12, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 12, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

dcramer commented Jul 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: add tool call support and ToolCallScorer #20

feat: add tool call support and ToolCallScorer #20

Conversation

dcramer commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

ToolCallScorer API Redesign

Configuration Options:

Key Improvements:

Example Usage:

Breaking Changes

Migration Guide

Commits

Uh oh!

codecov bot commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

dcramer commented Jul 13, 2025

Uh oh!

Uh oh!

Uh oh!

dcramer commented Jul 12, 2025 •

edited

Loading

codecov bot commented Jul 12, 2025 •

edited

Loading