|
| 1 | +--- |
| 2 | +id: multimodal-metrics-g-eval |
| 3 | +title: Multimodal G-Eval |
| 4 | +sidebar_label: Multimodal G-Eval |
| 5 | +--- |
| 6 | + |
| 7 | +<head> |
| 8 | + <link rel="canonical" href="https://deepeval.com/docs/metrics-llm-evals" /> |
| 9 | +</head> |
| 10 | + |
| 11 | +import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer"; |
| 12 | + |
| 13 | +<MetricTagsDisplayer custom={true} multimodal={true} /> |
| 14 | + |
| 15 | +The multimodal G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating multimodality LLM interactions instead. |
| 16 | + |
| 17 | +It is currently the best way to define custom criteria to evaluate text + images in `deepeval`. By defining a custom `MultimodalGEval`, you can easily determine how well your MLLMs are generating, editing, and referncing images for example. |
| 18 | + |
| 19 | +## Required Arguments |
| 20 | + |
| 21 | +To use the `MultimodalGEval`, you'll have to provide the following arguments when creating a [`MLLMTestCase`](/docs/evaluation-test-cases#mllm-test-case): |
| 22 | + |
| 23 | +- `input` |
| 24 | +- `actual_output` |
| 25 | + |
| 26 | +You'll also need to supply any additional arguments such as `expected_output` and `context` if your evaluation criteria depends on these parameters. |
| 27 | + |
| 28 | +:::tip |
| 29 | +The `input`s and `actual_output`s of an `MLLMTestCase` is a list of strings and/or `MLLMImage` objects. |
| 30 | +::: |
| 31 | + |
| 32 | +## Usage |
| 33 | + |
| 34 | +To create a custom metric that uses MLLMs for evaluation, simply instantiate an `MultimodalGEval` class and **define an evaluation criteria in everyday language**: |
| 35 | + |
| 36 | +```python |
| 37 | +from deepeval.metrics import MultimodalGEval |
| 38 | +from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase |
| 39 | + |
| 40 | +m_test_case = MLLMTestCase( |
| 41 | + input=["Show me how to fold an airplane"], |
| 42 | + actual_output=[ |
| 43 | + "1. Take the sheet of paper and fold it lengthwise", |
| 44 | + MLLMImage(url="./paper_plane_1", local=True), |
| 45 | + "2. Unfold the paper. Fold the top left and right corners towards the center.", |
| 46 | + MLLMImage(url="./paper_plane_2", local=True) |
| 47 | + ] |
| 48 | +) |
| 49 | +text_image_coherence = MultimodalGEval( |
| 50 | + name="Text-Image Coherence", |
| 51 | + criteria="Determine whether the images and text is coherence in the actual output.", |
| 52 | + evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT], |
| 53 | +) |
| 54 | + |
| 55 | +evaluate(test_cases=[m_test_case], metrics=[text_image_coherence]) |
| 56 | +``` |
| 57 | + |
| 58 | +There are **THREE** mandatory and **SEVEN** optional parameters required when instantiating an `MultimodalGEval` class: |
| 59 | + |
| 60 | +- `name`: name of custom metric. |
| 61 | +- `criteria`: a description outlining the specific evaluation aspects for each test case. |
| 62 | +- `evaluation_params`: a list of type `MLLMTestCaseParams`. Include only the parameters that are relevant for evaluation. |
| 63 | +- [Optional] `evaluation_steps`: a list of strings outlining the exact steps the LLM should take for evaluation. If `evaluation_steps` is not provided, `GEval` will generate a series of `evaluation_steps` on your behalf based on the provided `criteria`. |
| 64 | +- [Optional] `rubric`: a list of `Rubric`s that allows you to [confine the range](/docs/metrics-llm-evals#rubric) of the final metric score. |
| 65 | +- [Optional] `threshold`: the passing threshold, defaulted to 0.5. |
| 66 | +- [Optional] `model`: a string specifying which of OpenAI's GPT models to use, **OR** [any custom LLM model](/docs/metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to 'gpt-4o'. |
| 67 | +- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`. |
| 68 | +- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-metrics-in-async) Defaulted to `True`. |
| 69 | +- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`. |
| 70 | + |
| 71 | +:::danger |
| 72 | +For accurate and valid results, only the parameters that are mentioned in `criteria`/`evaluation_params` should be included as a member of `evaluation_params`. |
| 73 | +::: |
| 74 | + |
| 75 | +### Evaluation Steps |
| 76 | + |
| 77 | +Similar to regular [`GEval`](/docs/metrics-llm-evals), providing `evaluation_steps` tells `MultimodalGEval` to follow your `evaluation_steps` for evaluation instead of first generating one from `criteria`, which allows for more controllable metric scores: |
| 78 | + |
| 79 | +```python |
| 80 | +... |
| 81 | + |
| 82 | +text_image_coherence = MultimodalGEval( |
| 83 | + name="Text-Image Coherence", |
| 84 | + evaluation_steps=[ |
| 85 | + "Evaluate whether the images and the accompanying text in the actual output logically match and support each other.", |
| 86 | + "Check if the visual elements (images) enhance or contradict the meaning conveyed by the text.", |
| 87 | + "If there is a lack of coherence, identify where and how the text and images diverge or create confusion., |
| 88 | + ], |
| 89 | + evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT], |
| 90 | +) |
| 91 | +``` |
| 92 | + |
| 93 | +### Rubric |
| 94 | + |
| 95 | +You can also provide `Rubric`s through the `rubric` argument to confine your evaluation MLLM to output in specific score ranges: |
| 96 | + |
| 97 | +```python |
| 98 | +from deepeval.metrics.g_eval import Rubric |
| 99 | +... |
| 100 | + |
| 101 | +text_image_coherence = MultimodalGEval( |
| 102 | + name="Text-Image Coherence", |
| 103 | + rubric=[ |
| 104 | + Rubric(score_range=(1, 3), expected_outcome="Text and image are incoherent or conflicting."), |
| 105 | + Rubric(score_range=(4, 7), expected_outcome="Partial coherence with some mismatches."), |
| 106 | + Rubric(score_range=(8, 10), expected_outcome="Text and image are clearly coherent and aligned."), |
| 107 | + ], |
| 108 | + evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT], |
| 109 | +) |
| 110 | + |
| 111 | +``` |
| 112 | + |
| 113 | +Note that `score_range` ranges from **0 - 10, inclusive** and different `Rubric`s must not have overlapping `score_range`s. You can also specify `score_range`s where the start and end values are the same to represent a single score. |
| 114 | + |
| 115 | +:::tip |
| 116 | +This is an optional improvement done by `deepeval` in addition to the original implementation in the `GEval` paper. |
| 117 | +::: |
| 118 | + |
| 119 | +### As a standalone |
| 120 | + |
| 121 | +You can also run `GEval` on a single test case as a standalone, one-off execution. |
| 122 | + |
| 123 | +```python |
| 124 | +... |
| 125 | + |
| 126 | +text_image_coherence.measure(test_case) |
| 127 | +print(text_image_coherence.score, text_image_coherence.reason) |
| 128 | +``` |
| 129 | + |
| 130 | +:::caution |
| 131 | +This is great for debugging or if you wish to build your own evaluation pipeline, but you will **NOT** get the benefits (testing reports, Confident AI platform) and all the optimizations (speed, caching, computation) the `evaluate()` function or `deepeval test run` offers. |
| 132 | +::: |
| 133 | + |
| 134 | +## How Is It Calculated? |
| 135 | + |
| 136 | +The `MultimodalGEval` is an adapted version of [`GEval`](/docs/metrics-llm-evals), so alike `GEval`, the `MultimodalGEval` metric is a two-step algorithm that first generates a series of `evaluation_steps` using chain of thoughts (CoTs) based on the given `criteria`, before using the generated `evaluation_steps` to determine the final score using the `evaluation_params` provided through the `MLLMTestCase`. |
| 137 | + |
| 138 | +Unlike regular `GEval` though, the `MultimodalGEval` takes images into consideration as well. |
| 139 | + |
| 140 | +:::tip |
| 141 | +Similar to the original [G-Eval paper](https://arxiv.org/abs/2303.16634), the `MultimodalGEval` metric uses the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. This step was introduced in the paper to minimize bias in LLM scoring, and is automatically handled by `deepeval` (unless you're using a custom LLM). |
| 142 | +::: |
0 commit comments