Skip to content

Commit 330e138

Browse files
committed
docs for MGEval
1 parent 30280d5 commit 330e138

16 files changed

+188
-7
lines changed

docs/docs/metrics-conversational-g-eval.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@ import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
1616

1717
<MetricTagsDisplayer custom={true} chatbot={true} />
1818

19-
The conversational G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating entire conversations instead. It is currently the best way to define custom criteria to evaluate multi-turn conversations in `deepeval`. By defining a custom `ConversationalGEval`, you can easily determine whether your LLM chatbot is able to consistently generate responses that are up to standard with your custom criteria **throughout a conversation**.
19+
The conversational G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating entire conversations instead.
20+
21+
It is currently the best way to define custom criteria to evaluate multi-turn conversations in `deepeval`. By defining a custom `ConversationalGEval`, you can easily determine whether your LLM chatbot is able to consistently generate responses that are up to standard with your custom criteria **throughout a conversation**.
2022

2123
## Required Arguments
2224

docs/docs/multimodal-metrics-answer-relevancy.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Answer Relevancy
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The multimodal answer relevancy metric measures the quality of your Multimodal RAG pipeline's generator by evaluating how relevant the `actual_output` of your MLLM application is compared to the provided `input`. `deepeval`'s multimodal answer relevancy metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

docs/docs/multimodal-metrics-contextual-precision.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Contextual Precision
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The multimodal contextual precision metric measures your RAG pipeline's retriever by evaluating whether nodes in your `retrieval_context` that are relevant to the given `input` are ranked higher than irrelevant ones. `deepeval`'s multimodal contextual precision metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

docs/docs/multimodal-metrics-contextual-recall.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Contextual Recall
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The multimodal contextual recall metric measures the quality of your RAG pipeline's retriever by evaluating the extent of which the `retrieval_context` aligns with the `expected_output`. `deepeval`'s contextual recall metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

docs/docs/multimodal-metrics-contextual-relevancy.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Contextual Relevancy
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The multimodal contextual relevancy metric measures the quality of your multimodal RAG pipeline's retriever by evaluating the overall relevance of the information presented in your `retrieval_context` for a given `input`. `deepeval`'s multimodal contextual relevancy metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

docs/docs/multimodal-metrics-faithfulness.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Faithfulness
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The multimodal faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the `actual_output` factually aligns with the contents of your `retrieval_context`. `deepeval`'s multimodal faithfulness metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
id: multimodal-metrics-g-eval
3+
title: Multimodal G-Eval
4+
sidebar_label: Multimodal G-Eval
5+
---
6+
7+
<head>
8+
<link rel="canonical" href="https://deepeval.com/docs/metrics-llm-evals" />
9+
</head>
10+
11+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
12+
13+
<MetricTagsDisplayer custom={true} multimodal={true} />
14+
15+
The multimodal G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating multimodality LLM interactions instead.
16+
17+
It is currently the best way to define custom criteria to evaluate text + images in `deepeval`. By defining a custom `MultimodalGEval`, you can easily determine how well your MLLMs are generating, editing, and referncing images for example.
18+
19+
## Required Arguments
20+
21+
To use the `MultimodalGEval`, you'll have to provide the following arguments when creating a [`MLLMTestCase`](/docs/evaluation-test-cases#mllm-test-case):
22+
23+
- `input`
24+
- `actual_output`
25+
26+
You'll also need to supply any additional arguments such as `expected_output` and `context` if your evaluation criteria depends on these parameters.
27+
28+
:::tip
29+
The `input`s and `actual_output`s of an `MLLMTestCase` is a list of strings and/or `MLLMImage` objects.
30+
:::
31+
32+
## Usage
33+
34+
To create a custom metric that uses MLLMs for evaluation, simply instantiate an `MultimodalGEval` class and **define an evaluation criteria in everyday language**:
35+
36+
```python
37+
from deepeval.metrics import MultimodalGEval
38+
from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase
39+
40+
m_test_case = MLLMTestCase(
41+
input=["Show me how to fold an airplane"],
42+
actual_output=[
43+
"1. Take the sheet of paper and fold it lengthwise",
44+
MLLMImage(url="./paper_plane_1", local=True),
45+
"2. Unfold the paper. Fold the top left and right corners towards the center.",
46+
MLLMImage(url="./paper_plane_2", local=True)
47+
]
48+
)
49+
text_image_coherence = MultimodalGEval(
50+
name="Text-Image Coherence",
51+
criteria="Determine whether the images and text is coherence in the actual output.",
52+
evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
53+
)
54+
55+
evaluate(test_cases=[m_test_case], metrics=[text_image_coherence])
56+
```
57+
58+
There are **THREE** mandatory and **SEVEN** optional parameters required when instantiating an `MultimodalGEval` class:
59+
60+
- `name`: name of custom metric.
61+
- `criteria`: a description outlining the specific evaluation aspects for each test case.
62+
- `evaluation_params`: a list of type `MLLMTestCaseParams`. Include only the parameters that are relevant for evaluation.
63+
- [Optional] `evaluation_steps`: a list of strings outlining the exact steps the LLM should take for evaluation. If `evaluation_steps` is not provided, `GEval` will generate a series of `evaluation_steps` on your behalf based on the provided `criteria`.
64+
- [Optional] `rubric`: a list of `Rubric`s that allows you to [confine the range](/docs/metrics-llm-evals#rubric) of the final metric score.
65+
- [Optional] `threshold`: the passing threshold, defaulted to 0.5.
66+
- [Optional] `model`: a string specifying which of OpenAI's GPT models to use, **OR** [any custom LLM model](/docs/metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to 'gpt-4o'.
67+
- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`.
68+
- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-metrics-in-async) Defaulted to `True`.
69+
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.
70+
71+
:::danger
72+
For accurate and valid results, only the parameters that are mentioned in `criteria`/`evaluation_params` should be included as a member of `evaluation_params`.
73+
:::
74+
75+
### Evaluation Steps
76+
77+
Similar to regular [`GEval`](/docs/metrics-llm-evals), providing `evaluation_steps` tells `MultimodalGEval` to follow your `evaluation_steps` for evaluation instead of first generating one from `criteria`, which allows for more controllable metric scores:
78+
79+
```python
80+
...
81+
82+
text_image_coherence = MultimodalGEval(
83+
name="Text-Image Coherence",
84+
evaluation_steps=[
85+
"Evaluate whether the images and the accompanying text in the actual output logically match and support each other.",
86+
"Check if the visual elements (images) enhance or contradict the meaning conveyed by the text.",
87+
"If there is a lack of coherence, identify where and how the text and images diverge or create confusion.,
88+
],
89+
evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
90+
)
91+
```
92+
93+
### Rubric
94+
95+
You can also provide `Rubric`s through the `rubric` argument to confine your evaluation MLLM to output in specific score ranges:
96+
97+
```python
98+
from deepeval.metrics.g_eval import Rubric
99+
...
100+
101+
text_image_coherence = MultimodalGEval(
102+
name="Text-Image Coherence",
103+
rubric=[
104+
Rubric(score_range=(1, 3), expected_outcome="Text and image are incoherent or conflicting."),
105+
Rubric(score_range=(4, 7), expected_outcome="Partial coherence with some mismatches."),
106+
Rubric(score_range=(8, 10), expected_outcome="Text and image are clearly coherent and aligned."),
107+
],
108+
evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
109+
)
110+
111+
```
112+
113+
Note that `score_range` ranges from **0 - 10, inclusive** and different `Rubric`s must not have overlapping `score_range`s. You can also specify `score_range`s where the start and end values are the same to represent a single score.
114+
115+
:::tip
116+
This is an optional improvement done by `deepeval` in addition to the original implementation in the `GEval` paper.
117+
:::
118+
119+
### As a standalone
120+
121+
You can also run `GEval` on a single test case as a standalone, one-off execution.
122+
123+
```python
124+
...
125+
126+
text_image_coherence.measure(test_case)
127+
print(text_image_coherence.score, text_image_coherence.reason)
128+
```
129+
130+
:::caution
131+
This is great for debugging or if you wish to build your own evaluation pipeline, but you will **NOT** get the benefits (testing reports, Confident AI platform) and all the optimizations (speed, caching, computation) the `evaluate()` function or `deepeval test run` offers.
132+
:::
133+
134+
## How Is It Calculated?
135+
136+
The `MultimodalGEval` is an adapted version of [`GEval`](/docs/metrics-llm-evals), so alike `GEval`, the `MultimodalGEval` metric is a two-step algorithm that first generates a series of `evaluation_steps` using chain of thoughts (CoTs) based on the given `criteria`, before using the generated `evaluation_steps` to determine the final score using the `evaluation_params` provided through the `MLLMTestCase`.
137+
138+
Unlike regular `GEval` though, the `MultimodalGEval` takes images into consideration as well.
139+
140+
:::tip
141+
Similar to the original [G-Eval paper](https://arxiv.org/abs/2303.16634), the `MultimodalGEval` metric uses the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. This step was introduced in the paper to minimize bias in LLM scoring, and is automatically handled by `deepeval` (unless you're using a custom LLM).
142+
:::

docs/docs/multimodal-metrics-image-coherence.mdx

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Image Coherence
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer multimodal={true} />
1518

1619
The Image Coherence metric assesses the **coherent alignment of images with their accompanying text**, evaluating how effectively the visual content complements and enhances the textual narrative. `deepeval`'s Image Coherence metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

@@ -26,10 +29,6 @@ To use the `ImageCoherence`, you'll have to provide the following arguments when
2629
- `input`
2730
- `actual_output`
2831

29-
:::note
30-
Remember that the `actual_output` of an `MLLMTestCase` is a list of strings and `Image` objects. If multiple images are provided in the actual output, The final score will be the average of each image's coherence.
31-
:::
32-
3332
The `input` and `actual_output` are required to create an `MLLMTestCase` (and hence required by all metrics) even though they might not be used for metric calculation. Read the [How Is It Calculated](#how-is-it-calculated) section below to learn more.
3433

3534
## Usage

docs/docs/multimodal-metrics-image-editing.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Image Editing
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The Image Editing metric assesses the performance of **image editing tasks** by evaluating the quality of synthesized images based on semantic consistency and perceptual quality (similar to the `TextToImageMetric`). `deepeval`'s Image Editing metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

docs/docs/multimodal-metrics-image-helpfulness.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Image Helpfulness
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The Image Helpfulness metric assesses how effectively images **contribute to a user's comprehension of the text**, including providing additional insights, clarifying complex ideas, or supporting textual details. `deepeval`'s Image Helpfulness metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

docs/docs/multimodal-metrics-image-reference.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Image Reference
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The Image Reference metric evaluates how accurately images **are referred to or explained** by accompanying text. `deepeval`'s Image Reference metric is self-explaining within MLLM-Eval, meaning it provides a rationale for its assigned score.
1720

docs/docs/multimodal-metrics-text-to-image.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Text to Image
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The Text to Image metric assesses the performance of **image generation tasks** by evaluating the quality of synthesized images based on semantic consistency and perceptual quality. `deepeval`'s Text to Image metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
1720

docs/docs/multimodal-metrics-tool-correctness.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Tool Correctness
1212
</head>
1313

1414
import Equation from "@site/src/components/Equation";
15+
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
16+
17+
<MetricTagsDisplayer custom={true} multimodal={true} />
1518

1619
The multimodal tool correctness metric is an agentic LLM metric that assesses your multimodal LLM agent's function/tool calling ability. It is calculated by comparing whether every tool that is expected to be used was indeed called.
1720

docs/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ module.exports = {
8585
type: "category",
8686
label: "Multimodal Metrics",
8787
items: [
88+
"multimodal-metrics-g-eval",
8889
"multimodal-metrics-image-coherence",
8990
"multimodal-metrics-image-helpfulness",
9091
"multimodal-metrics-image-reference",

docs/src/components/MetricTagsDisplayer/MetricTagsDisplayer.module.css

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,9 @@
5959
border: 1px solid white;
6060
color: white;
6161
}
62+
63+
.multimodal {
64+
background-color: #fef2e9;
65+
border: 1px solid #fcc092;
66+
color: #924000;
67+
}

docs/src/components/MetricTagsDisplayer/index.jsx

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
import React from "react";
22
import styles from "./MetricTagsDisplayer.module.css";
33

4-
const MetricTagsDisplayer = ({ usesLLMs=true, referenceless=false, referenceBased=false, rag=false, agent=false, chatbot=false, custom=false, safety=false }) => {
4+
const MetricTagsDisplayer = ({ usesLLMs=true, referenceless=false, referenceBased=false, rag=false, agent=false, chatbot=false, custom=false, safety=false, multimodal=false }) => {
55
return (
66
<div className={styles.metricTagsDisplayer}>
7-
{usesLLMs && <div className={`${styles.pill} ${styles.usesLLM}`}>LLM-as-a-judge</div>}
7+
{usesLLMs && <div className={`${styles.pill} ${styles.usesLLM}`}>{multimodal ? "M" : ""}LLM-as-a-judge</div>}
88
{referenceless && <div className={`${styles.pill} ${styles.referenceless}`}>Referenceless metric</div>}
99
{referenceBased && <div className={`${styles.pill} ${styles.referenceBased}`}>Reference-based metric</div>}
1010
{rag && <div className={`${styles.pill} ${styles.rag}`}>RAG metric</div>}
1111
{agent && <div className={`${styles.pill} ${styles.agent}`}>Agent metric</div>}
1212
{chatbot && <div className={`${styles.pill} ${styles.chatbot}`}>Chatbot metric</div>}
1313
{custom && <div className={`${styles.pill} ${styles.custom}`}>Custom metric</div>}
1414
{safety && <div className={`${styles.pill} ${styles.safety}`}>Safety metric</div>}
15+
{multimodal && <div className={`${styles.pill} ${styles.multimodal}`}>Multimodal</div>}
1516
</div>
1617
);
1718
};

0 commit comments

Comments
 (0)