Skip to content

Commit 9dc8b1c

Browse files
authored
Merge pull request #255 from confident-ai/feature/updatedocs
Feature/updatedocs
2 parents 58b6d76 + 4534c23 commit 9dc8b1c

File tree

3 files changed

+98
-53
lines changed

3 files changed

+98
-53
lines changed

docs/docs/evaluation-metrics.mdx

Lines changed: 13 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -42,40 +42,30 @@ A custom LLM evalated metric, is a custom metric whose evaluation is powered by
4242

4343
```python
4444
from deepeval.metrics.llm_eval_metric import LLMEvalMetric
45+
from deepeval.types import LLMTestCaseParams
4546

46-
funny_metric = LLMEvalMetric(
47-
name="Funny",
48-
criteria="How funny it is",
47+
summarization_metric = LLMEvalMetric(
48+
name="Summarization",
49+
criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
50+
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
4951
minimum_score=0.5
5052
)
5153
```
5254

53-
There are two mandatory and two optional parameters required when instantiating an `LLMEvalMetric` class:
55+
There are three mandatory and one optional parameters required when instantiating an `LLMEvalMetric` class:
5456

55-
- `name`
56-
- `criteria`
57+
- `name`: name of metric
58+
- `criteria`: a description outlining the specific evaluation aspects for each test case.
59+
- `evaluation_params`: a list of type `LLMTestCaseParams`. Include only the parameters that are relevant for evaluation.
5760
- [Optional] `minimum_score`
58-
- [Optional] `completion_function`
5961

60-
All instances of `LLMEvalMetric` returns a score ranging from 0-1. A metric is only successful if the evaluation score is equal to or greater than `minimum_score`.
62+
All instances of `LLMEvalMetric` returns a score ranging from 0 - 1. A metric is only successful if the evaluation score is equal to or greater than `minimum_score`.
6163

62-
:::info
63-
`LLMEvalMetric` may or may not not require `context` or `expected_output` supplied to `LLMTestCase`, but we recommend providing both arguments where possible for the most accurate evaluation.
64+
:::danger
65+
For accurate and valid results, only the parameters that are mentioned in `criteria` should be included as a member of `evaluation_params`.
6466
:::
6567

66-
You can also supply a custom `completion_function` if for example you want to utilize another LLM provider to evaluate your `LLMTestCase`. By default, `deepeval` uses the `openai` chat completion function.
67-
68-
```python
69-
def make_chat_completion_request(prompt: str):
70-
response = openai.ChatCompletion.create(
71-
model="gpt-3.5-turbo",
72-
messages=[
73-
{"role": "system", "content": "You are a helpful assistant."},
74-
{"role": "user", "content": prompt},
75-
],
76-
)
77-
return response.choices[0].message.content
78-
```
68+
By defauly, `LLMEvalMetric` is evaluated using `GPT-4` from OpenAI.
7969

8070
## Custom Classic Metrics
8171

docs/docs/getting-started.mdx

Lines changed: 49 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -103,14 +103,20 @@ from deepeval.metrics.llm_eval_metric import LLMEvalMetric
103103

104104
...
105105

106-
def test_humor():
107-
input = "What if these shoes don't fit?"
106+
def test_summarization():
107+
input = "What if these shoes don't fit? I want a full refund."
108108

109-
# Replace this with the actual output of your LLM application
110-
actual_output = "We offer a 30-day full refund at no extra cost."
111-
funny_metric = LLMEvalMetric(name="Funny Metric", criteria="How funny it is", minimum_score=0.3)
109+
# Replace this with the actual output from your LLM application
110+
actual_output = "If the shoes don't fit, the customer wants a full refund."
111+
112+
summarization_metric = LLMEvalMetric(
113+
name="Summarization",
114+
criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
115+
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
116+
minimum_score=0.5
117+
)
112118
test_case = LLMTestCase(input=input, actual_output=actual_output)
113-
assert_test(test_case, [length_metric])
119+
assert_test(test_case, [summarization_metric])
114120
```
115121

116122
### Classic Metrics
@@ -181,10 +187,15 @@ def test_everything():
181187
actual_output = "We offer a 30-day full refund at no extra cost."
182188
factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.7)
183189
length_metric = LengthMetric(max_length=10)
184-
funny_metric = LLMEvalMetric(name="Funny Metric", criteria="How funny it is", minimum_score=0.3)
190+
summarization_metric = LLMEvalMetric(
191+
name="Summarization",
192+
criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
193+
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
194+
minimum_score=0.5
195+
)
185196

186197
test_case = LLMTestCase(input=input, actual_output=actual_output, context=context)
187-
assert_test(test_case, [factual_consistency_metric, length_metric, funny_metric])
198+
assert_test(test_case, [factual_consistency_metric, length_metric, summarization_metric])
188199
```
189200

190201
In this scenario, `test_everything` only passes if all metrics are passing. Run `deepeval test run` again to see the results:
@@ -267,20 +278,16 @@ deepeval test run test_bulk.py
267278

268279
If you have reached this point, you've likely ran `deepeval test run` multiple times. To keep track of all future evaluation results created by `deepeval`, login to **[Confident AI](https://app.confident-ai.com/auth/signup)** by running the following command:
269280

270-
```
271-
281+
```console
272282
deepeval login
273-
274283
```
275284

276285
**Confident AI** is the platform powering `deepeval`, and offer deep insights to help you quickly figure out how to best implement your LLM application. Follow the instructions displayed on the CLI to create an account, get your Confident API key, and paste it in the CLI.
277286

278287
Once you've pasted your Confident API key in the CLI, run:
279288

280-
```
281-
282-
deepeval test run test_examply.py
283-
289+
```console
290+
deepeval test run test_example.py
284291
```
285292

286293
### View Test Run
@@ -295,6 +302,33 @@ You can also view individual test cases for enhanced debugging:
295302

296303
![ok](https://d2lsxfc3p6r9rv.cloudfront.net/dashboard2.png)
297304

305+
### Compare Hyperparameters
306+
307+
To log hyperparameters (such as prompt templates used) for your LLM application, paste in the following code in `test_example.py`:
308+
309+
```python title="test_example.py"
310+
import deepeval
311+
312+
...
313+
314+
@deepeval.set_hyperparameters
315+
def hyperparameters():
316+
return {
317+
"chunk_size": 500,
318+
"temperature": 0,
319+
"model": "GPT-4",
320+
"prompt_template": """You are a helpful assistant, answer the following question in a non-judgemental tone.
321+
322+
Question:
323+
{question}
324+
""",
325+
}
326+
```
327+
328+
Execute `deepeval test run test_example.py` again to start comparing hyperparmeters for each test run.
329+
330+
![ok](https://d2lsxfc3p6r9rv.cloudfront.net/dashboard3.png)
331+
298332
## Full Example
299333

300334
You can find the full example [here on our Github](https://github.com/confident-ai/deepeval/blob/main/examples/getting_started/test_example.py).

examples/getting_started/test_example.py

Lines changed: 36 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from deepeval.metrics.llm_eval_metric import LLMEvalMetric
66
from deepeval.types import LLMTestCaseParams
77
from deepeval.metrics.base_metric import BaseMetric
8+
import deepeval
89

910

1011
def test_factual_consistency():
@@ -22,19 +23,23 @@ def test_factual_consistency():
2223
assert_test(test_case, [factual_consistency_metric])
2324

2425

25-
def test_humor():
26-
input = "What if these shoes don't fit?"
26+
def test_summarization():
27+
input = "What if these shoes don't fit? I want a full refund."
2728

2829
# Replace this with the actual output from your LLM application
29-
actual_output = "We offer a 30-day full refund at no extra cost."
30-
funny_metric = LLMEvalMetric(
31-
name="Funny Metric",
32-
criteria="How funny the actual output is",
33-
minimum_score=0.3,
34-
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
30+
actual_output = "If the shoes don't fit, the customer wants a full refund."
31+
32+
summarization_metric = LLMEvalMetric(
33+
name="Summarization",
34+
criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
35+
evaluation_params=[
36+
LLMTestCaseParams.INPUT,
37+
LLMTestCaseParams.ACTUAL_OUTPUT,
38+
],
39+
minimum_score=0.5,
3540
)
3641
test_case = LLMTestCase(input=input, actual_output=actual_output)
37-
assert_test(test_case, [funny_metric])
42+
assert_test(test_case, [summarization_metric])
3843

3944

4045
class LengthMetric(BaseMetric):
@@ -78,16 +83,32 @@ def test_everything():
7883
actual_output = "We offer a 30-day full refund at no extra cost."
7984
factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.7)
8085
length_metric = LengthMetric(max_length=10)
81-
funny_metric = LLMEvalMetric(
82-
name="Funny Metric",
83-
criteria="How funny it is",
84-
minimum_score=0.3,
85-
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
86+
summarization_metric = LLMEvalMetric(
87+
name="Summarization",
88+
criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
89+
evaluation_params=[
90+
LLMTestCaseParams.INPUT,
91+
LLMTestCaseParams.ACTUAL_OUTPUT,
92+
],
93+
minimum_score=0.5,
8694
)
8795

8896
test_case = LLMTestCase(
8997
input=input, actual_output=actual_output, context=context
9098
)
9199
assert_test(
92-
test_case, [factual_consistency_metric, length_metric, funny_metric]
100+
test_case,
101+
[factual_consistency_metric, length_metric, summarization_metric],
93102
)
103+
104+
105+
@deepeval.set_hyperparameters
106+
def hyperparameters():
107+
return {
108+
"model": "GPT-4",
109+
"prompt_template": """You are a helpful assistant, answer the following question in a non-judgemental tone.
110+
111+
Question:
112+
{question}
113+
""",
114+
}

0 commit comments

Comments
 (0)