Skip to content

feat: add tool call support and ToolCallScorer #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 13, 2025
Merged

Conversation

dcramer
Copy link
Member

@dcramer dcramer commented Jul 12, 2025

Summary

This PR improves the vitest-evals library by:

  1. Removing the misleading Factuality scorer (was just string comparison)
  2. Redesigning the ToolCallScorer API for better separation of concerns
  3. Fixing strict parameter comparison to be order-independent

Changes

ToolCallScorer API Redesign

The scorer now cleanly separates tool matching from parameter matching:

Configuration Options:

  • ordered (default: false) - Whether tools must be called in exact order
  • requireAll (default: true) - Whether all expected tools must be called
  • allowExtras (default: true) - Whether to allow additional tool calls
  • params (default: "strict") - How to match parameters:
    • "strict" - Deep equality (order-independent for objects)
    • "fuzzy" - Case-insensitive, subset matching, numeric tolerance
    • Custom function - Your own comparison logic

Key Improvements:

  • Test data now defines WHAT tools are expected (via expectedTools)
  • Scorer config defines HOW to evaluate them
  • Clearer separation between tool-level and parameter-level concerns
  • More predictable defaults (strict matching)
  • Fixed JSON.stringify issue - strict comparison now properly handles object key order

Example Usage:

// Define expected tools in test data
describeEval("tool usage", {
  data: async () => [{
    input: "Search for restaurants",
    expectedTools: [
      { name: "search", arguments: { type: "restaurant" } },
      { name: "filter", arguments: { cuisine: "italian" } }
    ]
  }],
  task: myTask,
  scorers: [
    ToolCallScorer({ params: "fuzzy" }) // Flexible matching
  ]
});

Breaking Changes

  • Default parameter matching is now strict (was fuzzy)
  • expectedTools moved from scorer config to test data
  • Renamed options for clarity:
    • requireAllToolsrequireAll
    • allowExtraToolsallowExtras
    • strictArgsparams: "strict"

Migration Guide

// Old
ToolCallScorer({
  tools: [...],
  strictArgs: true,
  allowExtraTools: false
})

// New
// Tools go in test data's expectedTools
ToolCallScorer({
  params: "strict", // default now
  allowExtras: false
})

Commits

  1. Remove Factuality scorer and redesign ToolCallScorer API - Separated data from config
  2. Fix extensibility issues in configuration - Improved naming consistency
  3. Improve ToolCallScorer configuration API - Better separation of concerns, clearer names
  4. Fix strict equality comparison - Replaced JSON.stringify with proper deep equality that handles object key order

- Add comprehensive ToolCall type supporting multiple LLM providers
- Add ToolCallScorer for evaluating tool usage patterns
- Update TaskResult to support both string and structured responses
- Improve type system with BaseScorerOptions and generic ScoreFn
- Add AI SDK integration examples
- Simplify README and move detailed docs to README-old.md
- Show Factuality scorer as example (not built-in) to keep library lightweight

BREAKING CHANGE: None - maintains full backward compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link

codecov bot commented Jul 12, 2025

Codecov Report

Attention: Patch coverage is 90.67164% with 25 lines in your changes missing coverage. Please review.

Project coverage is 84.50%. Comparing base (1c70cf3) to head (37eb70f).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/scorers/toolCallScorer.ts 92.85% 18 Missing ⚠️
src/index.ts 53.33% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #20       +/-   ##
===========================================
+ Coverage   69.34%   84.50%   +15.15%     
===========================================
  Files           2        4        +2     
  Lines         137      400      +263     
  Branches       28      115       +87     
===========================================
+ Hits           95      338      +243     
- Misses         42       62       +20     
Flag Coverage Δ
unittests 84.50% <90.67%> (+15.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dcramer

This comment was marked as resolved.

dcramer and others added 2 commits July 12, 2025 12:44
- Separate data expectations from evaluation logic
- Expected tools now defined in test data with arguments
- Scorer config focuses on HOW to evaluate (ordered, strictArgs, etc)
- Add smart fuzzy matching by default:
  - Case-insensitive strings
  - Numeric tolerance (0.1% or 0.001)
  - Object/array subset matching
- Add configuration options:
  - ordered: require exact sequence
  - strictArgs: exact argument matching
  - allowExtraTools: control extra tool tolerance
  - requireAllTools: enable partial credit
  - argMatcher: custom comparison function
- Improve error messages and metadata
- Add comprehensive tests and examples
- Update AI SDK integration examples

BREAKING CHANGE: expectedTools format changed from string[] to
{ name: string, arguments?: any }[] and moved from scorer config to test data

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Separate tool matching concerns from parameter matching
- Add 'params' option for parameter matching strategy ('strict' | 'fuzzy' | custom)
- Change default to strict parameter matching for better predictability
- Rename 'requireAllTools' to clearer 'requireAll'
- Rename 'allowExtraTools' to shorter 'allowExtras'
- Keep 'ordered' for tool order matching

BREAKING CHANGE: Default parameter matching is now strict instead of fuzzy.
Use params: 'fuzzy' to restore previous behavior.
@dcramer dcramer requested a review from Copilot July 12, 2025 23:35
Copilot

This comment was marked as outdated.

- Implement order-independent deep equality comparison
- Handle arrays, objects, primitives, null/undefined correctly
- Add test to verify object key order doesn't affect comparison
- Fix edge case test to use fuzzy params as intended
@dcramer dcramer requested a review from Copilot July 12, 2025 23:44
Copilot

This comment was marked as outdated.

- Fix expectedTools example to use object array instead of string array
- Rename internal allowExtraTools to allowExtras for consistency
- Improves clarity and reduces confusion in the API
@dcramer dcramer requested a review from Copilot July 12, 2025 23:47
Copilot

This comment was marked as outdated.

- Pass requireAll parameter to evaluateOrderedTools function
- Add partial credit logic when requireAll is false in ordered mode
- Add test to verify partial credit works correctly in ordered mode
- Makes ordered and unordered modes consistent in handling requireAll
@dcramer dcramer requested a review from Copilot July 12, 2025 23:54
Copilot

This comment was marked as outdated.

@dcramer dcramer requested a review from Copilot July 13, 2025 00:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for evaluating LLM tool calls by introducing the ToolCallScorer, redesigning its API, and updating the core describeEval runner and build configuration.

  • Update build config to ignore test files during bundling
  • Implement a configurable ToolCallScorer with strict/fuzzy matching and ordered/unordered tool evaluation
  • Extend describeEval and ScoreFn to pass through and include toolCalls in test metadata

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tsup.config.ts Exclude test files from the bundled entry
src/scorers/toolCallScorer.ts Add new ToolCallScorer logic with deep‐equality and fuzzy matching
src/index.ts Update describeEval, ScoreFn, and TaskFn to handle toolCalls
src/scorers/index.ts Re-export updated scorer types
package.json Add exports for scorers directory and adjust files field
README.md Refresh documentation and examples for tool call support
Comments suppressed due to low confidence (4)

tsup.config.ts:4

  • The tsup config currently excludes .test.ts and .test.*.ts files but not *.spec.ts files. Consider adding "!src/**/*.spec.ts" to prevent bundling spec tests in production builds.
  entry: ["src/**/*.ts", "!src/**/*.test.ts", "!src/**/*.test.*.ts"],

README.md:104

  • [nitpick] These example snippets aren't wrapped in a fenced code block, so they may not render correctly. Enclose them in triple backticks (e.g., ```javascript) before and after to ensure proper formatting.
scorers: [ToolCallScorer({ 

README.md:23

  • [nitpick] In this Quick Start example, response could be an object from queryLLM. If you intend to return a raw string, consider using response.text or clarifying the shape of response in the example.
    return response; // Simple string return

src/scorers/toolCallScorer.ts:221

  • [nitpick] Consider adding a JSDoc comment for this internal helper to explain its parameters (expected, actual, options) and return value, which will aid future maintainability.
function evaluateOrderedTools(

@dcramer
Copy link
Member Author

dcramer commented Jul 13, 2025

May end up changing this after we put it into use but gonna ship this first pass.

@dcramer dcramer merged commit 9754a44 into main Jul 13, 2025
9 checks passed
@dcramer dcramer deleted the better-scorers-tools branch July 13, 2025 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant