-
-
Notifications
You must be signed in to change notification settings - Fork 2
feat: add tool call support and ToolCallScorer #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add comprehensive ToolCall type supporting multiple LLM providers - Add ToolCallScorer for evaluating tool usage patterns - Update TaskResult to support both string and structured responses - Improve type system with BaseScorerOptions and generic ScoreFn - Add AI SDK integration examples - Simplify README and move detailed docs to README-old.md - Show Factuality scorer as example (not built-in) to keep library lightweight BREAKING CHANGE: None - maintains full backward compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## main #20 +/- ##
===========================================
+ Coverage 69.34% 84.50% +15.15%
===========================================
Files 2 4 +2
Lines 137 400 +263
Branches 28 115 +87
===========================================
+ Hits 95 338 +243
- Misses 42 62 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This comment was marked as resolved.
This comment was marked as resolved.
- Separate data expectations from evaluation logic - Expected tools now defined in test data with arguments - Scorer config focuses on HOW to evaluate (ordered, strictArgs, etc) - Add smart fuzzy matching by default: - Case-insensitive strings - Numeric tolerance (0.1% or 0.001) - Object/array subset matching - Add configuration options: - ordered: require exact sequence - strictArgs: exact argument matching - allowExtraTools: control extra tool tolerance - requireAllTools: enable partial credit - argMatcher: custom comparison function - Improve error messages and metadata - Add comprehensive tests and examples - Update AI SDK integration examples BREAKING CHANGE: expectedTools format changed from string[] to { name: string, arguments?: any }[] and moved from scorer config to test data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Separate tool matching concerns from parameter matching - Add 'params' option for parameter matching strategy ('strict' | 'fuzzy' | custom) - Change default to strict parameter matching for better predictability - Rename 'requireAllTools' to clearer 'requireAll' - Rename 'allowExtraTools' to shorter 'allowExtras' - Keep 'ordered' for tool order matching BREAKING CHANGE: Default parameter matching is now strict instead of fuzzy. Use params: 'fuzzy' to restore previous behavior.
- Implement order-independent deep equality comparison - Handle arrays, objects, primitives, null/undefined correctly - Add test to verify object key order doesn't affect comparison - Fix edge case test to use fuzzy params as intended
- Fix expectedTools example to use object array instead of string array - Rename internal allowExtraTools to allowExtras for consistency - Improves clarity and reduces confusion in the API
- Pass requireAll parameter to evaluateOrderedTools function - Add partial credit logic when requireAll is false in ordered mode - Add test to verify partial credit works correctly in ordered mode - Makes ordered and unordered modes consistent in handling requireAll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for evaluating LLM tool calls by introducing the ToolCallScorer
, redesigning its API, and updating the core describeEval
runner and build configuration.
- Update build config to ignore test files during bundling
- Implement a configurable
ToolCallScorer
with strict/fuzzy matching and ordered/unordered tool evaluation - Extend
describeEval
andScoreFn
to pass through and includetoolCalls
in test metadata
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tsup.config.ts | Exclude test files from the bundled entry |
src/scorers/toolCallScorer.ts | Add new ToolCallScorer logic with deep‐equality and fuzzy matching |
src/index.ts | Update describeEval , ScoreFn , and TaskFn to handle toolCalls |
src/scorers/index.ts | Re-export updated scorer types |
package.json | Add exports for scorers directory and adjust files field |
README.md | Refresh documentation and examples for tool call support |
Comments suppressed due to low confidence (4)
tsup.config.ts:4
- The tsup config currently excludes
.test.ts
and.test.*.ts
files but not*.spec.ts
files. Consider adding"!src/**/*.spec.ts"
to prevent bundling spec tests in production builds.
entry: ["src/**/*.ts", "!src/**/*.test.ts", "!src/**/*.test.*.ts"],
README.md:104
- [nitpick] These example snippets aren't wrapped in a fenced code block, so they may not render correctly. Enclose them in triple backticks (e.g., ```javascript) before and after to ensure proper formatting.
scorers: [ToolCallScorer({
README.md:23
- [nitpick] In this Quick Start example,
response
could be an object fromqueryLLM
. If you intend to return a raw string, consider usingresponse.text
or clarifying the shape ofresponse
in the example.
return response; // Simple string return
src/scorers/toolCallScorer.ts:221
- [nitpick] Consider adding a JSDoc comment for this internal helper to explain its parameters (
expected
,actual
,options
) and return value, which will aid future maintainability.
function evaluateOrderedTools(
May end up changing this after we put it into use but gonna ship this first pass. |
Summary
This PR improves the vitest-evals library by:
Changes
ToolCallScorer API Redesign
The scorer now cleanly separates tool matching from parameter matching:
Configuration Options:
ordered
(default: false) - Whether tools must be called in exact orderrequireAll
(default: true) - Whether all expected tools must be calledallowExtras
(default: true) - Whether to allow additional tool callsparams
(default: "strict") - How to match parameters:"strict"
- Deep equality (order-independent for objects)"fuzzy"
- Case-insensitive, subset matching, numeric toleranceKey Improvements:
expectedTools
)Example Usage:
Breaking Changes
expectedTools
moved from scorer config to test datarequireAllTools
→requireAll
allowExtraTools
→allowExtras
strictArgs
→params: "strict"
Migration Guide
Commits