10 Jun 09:16

c64d86e

🎉 New Conversational Evaluation, LiteLLM Integration Latest

Latest

In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.

Previously we assumed a conversation as as a list of LLMTestCases, which might necessarily be the case. Now a conversational test case is made up of a list of Turns instead, which follows OpenAI's standard messages format:

from deepeval.test_case import Turn

turns = [Turn(role="user", content="...")]

Docs here: https://deepeval.com/docs/evaluation-test-cases#conversational-test-case

Assets 2

07 Jun 11:09

penguine-ip

v3.0.6

5e81905

New Loading Bars, And Cloud Storage

Added new loading bars for component-level evals, and deepeval view to see results on Confident AI.

Assets 2

27 May 17:59

penguine-ip

v3.0

ce6a763

LLM Evals - v3.0

🚀 DeepEval v3.0 — Evaluate Any LLM Workflow, Anywhere

We’re excited to introduce DeepEval v3.0, a major milestone that transforms how you evaluate LLM applications — from complex multi-step agents to simple prompt chains. This release brings component-level granularity, production-ready observability, and simulation tools to empower devs building modern AI systems.

🔍 Component-Level Evaluation for Agentic Workflows

You can now apply DeepEval metrics to any step of your LLM workflow — tools, memories, retrievers, generators — and monitor them in both development and production.

Evaluate individual function calls, not just final outputs
Works with any framework or custom agent logic
Real-time evaluation in production using observe()
Track sub-component performance over time

📘 Learn more →

🧠 Conversation Simulation

Automatically simulate realistic multi-turn conversations to test your chatbots and agents.

Define model goals and user behavior
Generate labeled conversations at scale
Use DeepEval metrics to assess response quality
Customize turn count, persona types, and more

📘 Try the simulator →

🧬 Generate Goldens from Goldens

Bootstrapping eval datasets just got easier. Now you can exponentially expand your test cases using LLM-generated variants of existing goldens.

Transform goldens into many meaningful test cases
Preserve structure while diversifying content
Control tone, complexity, length, and more

📘 Read the guide →

🔒 Red Teaming Moved to DeepTeam

All red teaming functionality now lives in its own focused project: DeepTeam. DeepTeam is built for LLM security — adversarial testing, attack generation, and vulnerability discovery.

🛠️ Install or Upgrade

pip install deepeval --upgrade

🧠 Why v3.0 Matters

DeepEval v3.0 is more than an evaluation framework — it's a foundation for LLM observability. Whether you're debugging agents, simulating conversations, or continuously monitoring production performance, DeepEval now meets you wherever your LLM logic runs.

Ready to explore?
📚 Full docs at deepeval.com →

Assets 2

15 May 05:13

penguine-ip

v2.9.0

086d6dd

G-Eval Rubric

Rubric Available for G-Eval

https://www.deepeval.com/docs/metrics-llm-evals#rubric

Assets 2

06 May 11:58

penguine-ip

v2.8.5

78310fb

Cleanup Tracing, Component Evals, Etc.

In this release we've cleaned up some dependencies to separate out dev packages, as well as more tracing verbose logs for debugging.

Assets 2

28 Apr 07:48

penguine-ip

v2.7.9

8505390

v3.0 Pre-Release

🚨 Breaking Changes

⚠️ This release introduces breaking changes in preparation for DeepEval v3.0.
Please review carefully and adjust your code as needed.

The `evaluate()` function now has "configs"

Previously the evaluate() function had 13+ arguments to control display, async behaviors, caching, etc. and it was growing out of control. We've now abstracted it into "configs" instead:

from deepeval.evaluate.configs import AsyncConfig
from deepeval import evaluate

evaluate(..., async_config=AsyncConfig(max_concurrent=20))

Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#configs-for-evaluate

Red Teaming Officially Migrated to DeepTeam

This shouldn't be a surprised but, DeepTeam now takes care of everything red teaming related, for the foreseeable future. Docs here: https://trydeepteam.com

🥳 New Feature

Dynamic Evaluations for Nested Components

Nested components are a mess to evaluate. In this version in preparation for v3.0, we introduced dynamic evals, where you can apply a different set of metrics for different components in your LLM application:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span_test_case

@observe(metrics=[AnswerRelevancyMetric()])
def complete(query: str):
  response = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}]).choices[0].message["content"]

  update_current_span_test_case(
    test_case=LLMTestCase(input=query, output=response)
  )
  return response

Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#setup-tracing-highly-recommended

Assets 2

23 Apr 03:14

penguine-ip

v2.7.6

0abe6c3

Dependency Cleaning

Cleaned up dependencies for upcoming 3.0 release:

Removed the automatic updates, it is now opt-in: https://www.deepeval.com/docs/miscellaneous
Removed instructor, double checked and it wasn't used anywhere
Removed LlamaIndex and moved it to optional, only needed for one module

Assets 2

07 Apr 04:27

penguine-ip

v2.6.8

acf3c22

Conversation Simulator

The latest conversation simulator simulates fake user interactions to generate conversations on your behalf. These conversations can be used for evaluation right afterwards, and is similar to the goldens synthesizer. Docs here: https://docs.confident-ai.com/docs/evaluation-conversation-simulator

Assets 2

26 Mar 07:42

penguine-ip

v2.6.5

f1122c3

Better Custom Model Support

What's New 🔥

Migrated default provider models to support Synthesizer
Default model providers are now in a different directory, those that are using deepeval < 2.5.6 might need to update imports

Assets 2

18 Mar 09:36

penguine-ip

v2.5.9

8112bb5

Custom Prompts for Metrics

What's New 🔥

Custom prompt template overriding for all RAG metrics. This was introduced for folks using weaker models for evaluation, or just models in general that don't fit too well with OpenAI's prompt formatting, which is what most of deepeval's metrics are built around. You can still use your favorite metrics and algorithms, but now with a custom template if required. Example here: https://docs.confident-ai.com/docs/metrics-answer-relevancy#customize-your-template
Fixes to our model providers. Now more stable and usable.
Including save_as() for datasets to save test cases as well: https://docs.confident-ai.com/docs/evaluation-datasets#save-your-dataset
Bug fixes for Synthesizer
Improvements to prompt templates of DAGMetric: https://docs.confident-ai.com/docs/metrics-dag

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.

Uh oh!

Uh oh!

🚀 DeepEval v3.0 — Evaluate Any LLM Workflow, Anywhere

🔍 Component-Level Evaluation for Agentic Workflows

🧠 Conversation Simulation

🧬 Generate Goldens from Goldens

🔒 Red Teaming Moved to DeepTeam

🛠️ Install or Upgrade

🧠 Why v3.0 Matters

Uh oh!

Rubric Available for G-Eval

Uh oh!

Uh oh!

🚨 Breaking Changes

The `evaluate()` function now has "configs"

Red Teaming Officially Migrated to DeepTeam

🥳 New Feature

Dynamic Evaluations for Nested Components

Uh oh!

Uh oh!

Uh oh!

Uh oh!

What's New 🔥

Uh oh!

Releases: confident-ai/deepeval

🎉 New Conversational Evaluation, LiteLLM Integration

In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.

Uh oh!

New Loading Bars, And Cloud Storage

Uh oh!

LLM Evals - v3.0

🚀 DeepEval v3.0 — Evaluate Any LLM Workflow, Anywhere

🔍 Component-Level Evaluation for Agentic Workflows

🧠 Conversation Simulation

🧬 Generate Goldens from Goldens

🔒 Red Teaming Moved to DeepTeam

🛠️ Install or Upgrade

🧠 Why v3.0 Matters

Uh oh!

G-Eval Rubric

Rubric Available for G-Eval

Uh oh!

Cleanup Tracing, Component Evals, Etc.

Uh oh!

v3.0 Pre-Release

🚨 Breaking Changes

The evaluate() function now has "configs"

Red Teaming Officially Migrated to DeepTeam

🥳 New Feature

Dynamic Evaluations for Nested Components

Uh oh!

Dependency Cleaning

Uh oh!

Conversation Simulator

Uh oh!

Better Custom Model Support

Uh oh!

Custom Prompts for Metrics

What's New 🔥

Uh oh!

The `evaluate()` function now has "configs"