docs for MGEval

penguine-ip · penguine-ip · commit 330e13855303 · 2025-06-13T03:25:18.000+08:00
diff --git a/docs/docs/metrics-conversational-g-eval.mdx b/docs/docs/metrics-conversational-g-eval.mdx
@@ -16,7 +16,9 @@ import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
 
 <MetricTagsDisplayer custom={true} chatbot={true} />
 
-The conversational G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating entire conversations instead. It is currently the best way to define custom criteria to evaluate multi-turn conversations in `deepeval`. By defining a custom `ConversationalGEval`, you can easily determine whether your LLM chatbot is able to consistently generate responses that are up to standard with your custom criteria **throughout a conversation**.
+The conversational G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating entire conversations instead.
+
+It is currently the best way to define custom criteria to evaluate multi-turn conversations in `deepeval`. By defining a custom `ConversationalGEval`, you can easily determine whether your LLM chatbot is able to consistently generate responses that are up to standard with your custom criteria **throughout a conversation**.
 
 ## Required Arguments
 
diff --git a/docs/docs/multimodal-metrics-answer-relevancy.mdx b/docs/docs/multimodal-metrics-answer-relevancy.mdx
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Answer Relevancy
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The multimodal answer relevancy metric measures the quality of your Multimodal RAG pipeline's generator by evaluating how relevant the `actual_output` of your MLLM application is compared to the provided `input`. `deepeval`'s multimodal answer relevancy metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-contextual-precision.mdx b/docs/docs/multimodal-metrics-contextual-precision.mdx
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Contextual Precision
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The multimodal contextual precision metric measures your RAG pipeline's retriever by evaluating whether nodes in your `retrieval_context` that are relevant to the given `input` are ranked higher than irrelevant ones. `deepeval`'s multimodal contextual precision metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-contextual-recall.mdx b/docs/docs/multimodal-metrics-contextual-recall.mdx
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Contextual Recall
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The multimodal contextual recall metric measures the quality of your RAG pipeline's retriever by evaluating the extent of which the `retrieval_context` aligns with the `expected_output`. `deepeval`'s contextual recall metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-contextual-relevancy.mdx b/docs/docs/multimodal-metrics-contextual-relevancy.mdx
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Contextual Relevancy
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The multimodal contextual relevancy metric measures the quality of your multimodal RAG pipeline's retriever by evaluating the overall relevance of the information presented in your `retrieval_context` for a given `input`. `deepeval`'s multimodal contextual relevancy metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-faithfulness.mdx b/docs/docs/multimodal-metrics-faithfulness.mdx
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Faithfulness
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The multimodal faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the `actual_output` factually aligns with the contents of your `retrieval_context`. `deepeval`'s multimodal faithfulness metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-g-eval.mdx b/docs/docs/multimodal-metrics-g-eval.mdx
@@ -0,0 +1,142 @@
+---
+id: multimodal-metrics-g-eval
+title: Multimodal G-Eval
+sidebar_label: Multimodal G-Eval
+---
+
+<head>
+  <link rel="canonical" href="https://deepeval.com/docs/metrics-llm-evals" />
+</head>
+
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
+
+The multimodal G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating multimodality LLM interactions instead.
+
+It is currently the best way to define custom criteria to evaluate text + images in `deepeval`. By defining a custom `MultimodalGEval`, you can easily determine how well your MLLMs are generating, editing, and referncing images for example.
+
+## Required Arguments
+
+To use the `MultimodalGEval`, you'll have to provide the following arguments when creating a [`MLLMTestCase`](/docs/evaluation-test-cases#mllm-test-case):
+
+- `input`
+- `actual_output`
+
+You'll also need to supply any additional arguments such as `expected_output` and `context` if your evaluation criteria depends on these parameters.
+
+:::tip
+The `input`s and `actual_output`s of an `MLLMTestCase` is a list of strings and/or `MLLMImage` objects.
+:::
+
+## Usage
+
+To create a custom metric that uses MLLMs for evaluation, simply instantiate an `MultimodalGEval` class and **define an evaluation criteria in everyday language**:
+
+```python
+from deepeval.metrics import MultimodalGEval
+from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase
+
+m_test_case = MLLMTestCase(
+    input=["Show me how to fold an airplane"],
+    actual_output=[
+        "1. Take the sheet of paper and fold it lengthwise",
+        MLLMImage(url="./paper_plane_1", local=True),
+        "2. Unfold the paper. Fold the top left and right corners towards the center.",
+        MLLMImage(url="./paper_plane_2", local=True)
+    ]
+)
+text_image_coherence = MultimodalGEval(
+    name="Text-Image Coherence",
+    criteria="Determine whether the images and text is coherence in the actual output.",
+    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
+)
+
+evaluate(test_cases=[m_test_case], metrics=[text_image_coherence])
+```
+
+There are **THREE** mandatory and **SEVEN** optional parameters required when instantiating an `MultimodalGEval` class:
+
+- `name`: name of custom metric.
+- `criteria`: a description outlining the specific evaluation aspects for each test case.
+- `evaluation_params`: a list of type `MLLMTestCaseParams`. Include only the parameters that are relevant for evaluation.
+- [Optional] `evaluation_steps`: a list of strings outlining the exact steps the LLM should take for evaluation. If `evaluation_steps` is not provided, `GEval` will generate a series of `evaluation_steps` on your behalf based on the provided `criteria`.
+- [Optional] `rubric`: a list of `Rubric`s that allows you to [confine the range](/docs/metrics-llm-evals#rubric) of the final metric score.
+- [Optional] `threshold`: the passing threshold, defaulted to 0.5.
+- [Optional] `model`: a string specifying which of OpenAI's GPT models to use, **OR** [any custom LLM model](/docs/metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to 'gpt-4o'.
+- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`.
+- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-metrics-in-async) Defaulted to `True`.
+- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.
+
+:::danger
+For accurate and valid results, only the parameters that are mentioned in `criteria`/`evaluation_params` should be included as a member of `evaluation_params`.
+:::
+
+### Evaluation Steps
+
+Similar to regular [`GEval`](/docs/metrics-llm-evals), providing `evaluation_steps` tells `MultimodalGEval` to follow your `evaluation_steps` for evaluation instead of first generating one from `criteria`, which allows for more controllable metric scores:
+
+```python
+...
+
+text_image_coherence = MultimodalGEval(
+    name="Text-Image Coherence",
+    evaluation_steps=[
+        "Evaluate whether the images and the accompanying text in the actual output logically match and support each other.",
+        "Check if the visual elements (images) enhance or contradict the meaning conveyed by the text.",
+        "If there is a lack of coherence, identify where and how the text and images diverge or create confusion.,
+    ],
+    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
+)
+```
+
+### Rubric
+
+You can also provide `Rubric`s through the `rubric` argument to confine your evaluation MLLM to output in specific score ranges:
+
+```python
+from deepeval.metrics.g_eval import Rubric
+...
+
+text_image_coherence = MultimodalGEval(
+    name="Text-Image Coherence",
+    rubric=[
+        Rubric(score_range=(1, 3), expected_outcome="Text and image are incoherent or conflicting."),
+        Rubric(score_range=(4, 7), expected_outcome="Partial coherence with some mismatches."),
+        Rubric(score_range=(8, 10), expected_outcome="Text and image are clearly coherent and aligned."),
+    ],
+    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
+)
+
+```
+
+Note that `score_range` ranges from **0 - 10, inclusive** and different `Rubric`s must not have overlapping `score_range`s. You can also specify `score_range`s where the start and end values are the same to represent a single score.
+
+:::tip
+This is an optional improvement done by `deepeval` in addition to the original implementation in the `GEval` paper.
+:::
+
+### As a standalone
+
+You can also run `GEval` on a single test case as a standalone, one-off execution.
+
+```python
+...
+
+text_image_coherence.measure(test_case)
+print(text_image_coherence.score, text_image_coherence.reason)
+```
+
+:::caution
+This is great for debugging or if you wish to build your own evaluation pipeline, but you will **NOT** get the benefits (testing reports, Confident AI platform) and all the optimizations (speed, caching, computation) the `evaluate()` function or `deepeval test run` offers.
+:::
+
+## How Is It Calculated?
+
+The `MultimodalGEval` is an adapted version of [`GEval`](/docs/metrics-llm-evals), so alike `GEval`, the `MultimodalGEval` metric is a two-step algorithm that first generates a series of `evaluation_steps` using chain of thoughts (CoTs) based on the given `criteria`, before using the generated `evaluation_steps` to determine the final score using the `evaluation_params` provided through the `MLLMTestCase`.
+
+Unlike regular `GEval` though, the `MultimodalGEval` takes images into consideration as well.
+
+:::tip
+Similar to the original [G-Eval paper](https://arxiv.org/abs/2303.16634), the `MultimodalGEval` metric uses the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. This step was introduced in the paper to minimize bias in LLM scoring, and is automatically handled by `deepeval` (unless you're using a custom LLM).
+:::
diff --git a/docs/docs/multimodal-metrics-image-coherence.mdx b/docs/docs/multimodal-metrics-image-coherence.mdx
@@ -12,6 +12,9 @@ sidebar_label: Image Coherence
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer multimodal={true} />
 
 The Image Coherence metric assesses the **coherent alignment of images with their accompanying text**, evaluating how effectively the visual content complements and enhances the textual narrative. `deepeval`'s Image Coherence metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
@@ -26,10 +29,6 @@ To use the `ImageCoherence`, you'll have to provide the following arguments when
 - `input`
 - `actual_output`
 
-:::note
-Remember that the `actual_output` of an `MLLMTestCase` is a list of strings and `Image` objects. If multiple images are provided in the actual output, The final score will be the average of each image's coherence.
-:::
-
 The `input` and `actual_output` are required to create an `MLLMTestCase` (and hence required by all metrics) even though they might not be used for metric calculation. Read the [How Is It Calculated](#how-is-it-calculated) section below to learn more.
 
 ## Usage
diff --git a/docs/docs/multimodal-metrics-image-editing.mdx b/docs/docs/multimodal-metrics-image-editing.mdx
@@ -12,6 +12,9 @@ sidebar_label: Image Editing
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The Image Editing metric assesses the performance of **image editing tasks** by evaluating the quality of synthesized images based on semantic consistency and perceptual quality (similar to the `TextToImageMetric`). `deepeval`'s Image Editing metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-image-helpfulness.mdx b/docs/docs/multimodal-metrics-image-helpfulness.mdx
@@ -12,6 +12,9 @@ sidebar_label: Image Helpfulness
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The Image Helpfulness metric assesses how effectively images **contribute to a user's comprehension of the text**, including providing additional insights, clarifying complex ideas, or supporting textual details. `deepeval`'s Image Helpfulness metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-image-reference.mdx b/docs/docs/multimodal-metrics-image-reference.mdx
@@ -12,6 +12,9 @@ sidebar_label: Image Reference
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The Image Reference metric evaluates how accurately images **are referred to or explained** by accompanying text. `deepeval`'s Image Reference metric is self-explaining within MLLM-Eval, meaning it provides a rationale for its assigned score.
 
diff --git a/docs/docs/multimodal-metrics-text-to-image.mdx b/docs/docs/multimodal-metrics-text-to-image.mdx
@@ -12,6 +12,9 @@ sidebar_label: Text to Image
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The Text to Image metric assesses the performance of **image generation tasks** by evaluating the quality of synthesized images based on semantic consistency and perceptual quality. `deepeval`'s Text to Image metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
 
diff --git a/docs/docs/multimodal-metrics-tool-correctness.mdx b/docs/docs/multimodal-metrics-tool-correctness.mdx
@@ -12,6 +12,9 @@ sidebar_label: Multimodal Tool Correctness
 </head>
 
 import Equation from "@site/src/components/Equation";
+import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";
+
+<MetricTagsDisplayer custom={true} multimodal={true} />
 
 The multimodal tool correctness metric is an agentic LLM metric that assesses your multimodal LLM agent's function/tool calling ability. It is calculated by comparing whether every tool that is expected to be used was indeed called.
 
diff --git a/docs/sidebars.js b/docs/sidebars.js
@@ -85,6 +85,7 @@ module.exports = {
               type: "category",
               label: "Multimodal Metrics",
               items: [
+                "multimodal-metrics-g-eval",
                 "multimodal-metrics-image-coherence",
                 "multimodal-metrics-image-helpfulness",
                 "multimodal-metrics-image-reference",
diff --git a/docs/src/components/MetricTagsDisplayer/MetricTagsDisplayer.module.css b/docs/src/components/MetricTagsDisplayer/MetricTagsDisplayer.module.css
@@ -59,3 +59,9 @@
   border: 1px solid white;
   color: white;
 }
+
+.multimodal {
+  background-color: #fef2e9;
+  border: 1px solid #fcc092;
+  color: #924000;
+}
diff --git a/docs/src/components/MetricTagsDisplayer/index.jsx b/docs/src/components/MetricTagsDisplayer/index.jsx
@@ -1,17 +1,18 @@
 import React from "react";
 import styles from "./MetricTagsDisplayer.module.css";
 
-const MetricTagsDisplayer = ({ usesLLMs=true, referenceless=false, referenceBased=false, rag=false, agent=false, chatbot=false, custom=false, safety=false }) => {
+const MetricTagsDisplayer = ({ usesLLMs=true, referenceless=false, referenceBased=false, rag=false, agent=false, chatbot=false, custom=false, safety=false, multimodal=false }) => {
   return (
     <div className={styles.metricTagsDisplayer}>
-      {usesLLMs && <div className={`${styles.pill} ${styles.usesLLM}`}>LLM-as-a-judge</div>}
+      {usesLLMs && <div className={`${styles.pill} ${styles.usesLLM}`}>{multimodal ? "M" : ""}LLM-as-a-judge</div>}
       {referenceless && <div className={`${styles.pill} ${styles.referenceless}`}>Referenceless metric</div>}
       {referenceBased && <div className={`${styles.pill} ${styles.referenceBased}`}>Reference-based metric</div>}
       {rag && <div className={`${styles.pill} ${styles.rag}`}>RAG metric</div>}
       {agent && <div className={`${styles.pill} ${styles.agent}`}>Agent metric</div>}
       {chatbot && <div className={`${styles.pill} ${styles.chatbot}`}>Chatbot metric</div>}
       {custom && <div className={`${styles.pill} ${styles.custom}`}>Custom metric</div>}
       {safety && <div className={`${styles.pill} ${styles.safety}`}>Safety metric</div>}
+      {multimodal && <div className={`${styles.pill} ${styles.multimodal}`}>Multimodal</div>}
     </div>
   );
 };