Skip to content

Commit 065246b

Browse files
authored
Update readme.md with open-webui & envoy-ai-gateway usage (#365)
* Update readme.md Signed-off-by: kerthcet <[email protected]> * Update Signed-off-by: kerthcet <[email protected]> * Update Signed-off-by: kerthcet <[email protected]> * Update Signed-off-by: kerthcet <[email protected]> * Update Signed-off-by: kerthcet <[email protected]> --------- Signed-off-by: kerthcet <[email protected]>
1 parent 846cb80 commit 065246b

File tree

10 files changed

+228
-245
lines changed

10 files changed

+228
-245
lines changed

README.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,13 @@ Easy, advanced inference platform for large language models on Kubernetes
3232

3333
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
3434
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
35-
- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
3635
- **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
37-
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
3836
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
3937
- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
38+
- **AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
39+
- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./docs/open-webui.md).
4040
- **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
41-
- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), see configurations [here](./docs/open-webui.md).
41+
- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
4242

4343
## Quick Start
4444

@@ -51,7 +51,7 @@ Read the [Installation](./docs/installation.md) for guidance.
5151
Here's a toy example for deploying `facebook/opt-125m`, all you need to do
5252
is to apply a `Model` and a `Playground`.
5353

54-
If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md), or more [examples](/docs/examples/README.md) here.
54+
If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md).
5555

5656
> Note: if your model needs Huggingface token for weight downloads, please run `kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token>` ahead.
5757
@@ -118,14 +118,13 @@ curl http://localhost:8080/v1/completions \
118118

119119
### More than quick-start
120120

121-
If you want to learn more about this project, please refer to [develop.md](./docs/develop.md).
121+
Please refer to [examples](./docs/examples/README.md) for more tutorials or read [develop.md](./docs/develop.md) to learn more about the project.
122122

123123
## Roadmap
124124

125-
- Gateway support for traffic routing
126-
- Metrics support
127125
- Serverless support for cloud-agnostic users
128-
- CLI tool support
126+
- Prefill-Decode disaggregated serving
127+
- KV cache offload support
129128
- Model training, fine tuning in the long-term
130129

131130
## Community

chart/Chart.lock

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,11 @@ dependencies:
22
- name: open-webui
33
repository: https://helm.openwebui.com/
44
version: 6.4.0
5-
digest: sha256:2520f6e26f2e6fd3e51c5f7f940eef94217c125a9828b0f59decedbecddcdb29
6-
generated: "2025-04-21T00:50:06.532039+08:00"
5+
- name: gateway-helm
6+
repository: oci://registry-1.docker.io/envoyproxy/
7+
version: 0.0.0-latest
8+
- name: ai-gateway-helm
9+
repository: oci://registry-1.docker.io/envoyproxy/
10+
version: v0.0.0-latest
11+
digest: sha256:c7b1aa22097a6a1a6f4dd04beed3287ab8ef2ae1aec8a9a4ec7a71251be23e4c
12+
generated: "2025-04-22T20:15:43.343515+08:00"

chart/Chart.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,11 @@ dependencies:
2525
version: "6.4.0"
2626
repository: "https://helm.openwebui.com/"
2727
condition: open-webui.enabled
28-
- name: envoy-gateway
29-
version: v1.3.2
30-
repository: oci://docker.io/envoyproxy/gateway-helm
28+
- name: gateway-helm
29+
version: 0.0.0-latest
30+
repository: "oci://registry-1.docker.io/envoyproxy/"
3131
condition: envoy-gateway.enabled
32-
- name: envoy-ai-gateway
33-
version: v0.1.5
34-
repository: oci://docker.io/envoyproxy/ai-gateway-helm
32+
- name: ai-gateway-helm
33+
version: v0.0.0-latest
34+
repository: "oci://registry-1.docker.io/envoyproxy/"
3535
condition: envoy-ai-gateway.enabled

chart/values.global.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ prometheus:
3434
enabled: true
3535

3636
open-webui:
37-
enabled: false
37+
enabled: true
3838
persistence:
3939
enabled: false
4040
enableOpenaiApi: true

docs/envoy-ai-gateway.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Envoy AI Gateway
2+
3+
[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open source project for using Envoy Gateway
4+
to handle request traffic from application clients to Generative AI services.
5+
6+
## How to use
7+
8+
### 1. Enable Envoy Gateway and Envoy AI Gateway
9+
10+
Both of them are enabled by default in `values.global.yaml` and will be deployed in llmaz-system.
11+
12+
```yaml
13+
envoy-gateway:
14+
enabled: true
15+
envoy-ai-gateway:
16+
enabled: true
17+
```
18+
19+
However, [Envoy Gateway](https://gateway.envoyproxy.io/latest/install/install-helm/) and [Envoy AI Gateway](https://aigateway.envoyproxy.io/docs/getting-started/) can be deployed standalone in case you want to deploy them in other namespaces.
20+
21+
### 2. Basic AI Gateway Example
22+
23+
To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this.
24+
25+
We'll deploy two models `Qwen/Qwen2-0.5B-Instruct-GGUF` and `Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF` with llama.cpp (cpu only) and expose them via Envoy AI Gateway.
26+
27+
The full example is [here](./examples/envoy-ai-gateway/basic.yaml), apply it.
28+
29+
### 3. Check Envoy AI Gateway APIs
30+
31+
If Open-WebUI is enabled, you can chat via the webui (recommended), see [documentation](./open-webui.md). Otherwise, following the steps below to test the Envoy AI Gateway APIs.
32+
33+
I. Port-forwarding the `LoadBalancer` service in llmaz-system with port 8080.
34+
35+
II. Query `http://localhost:8008/v1/models | jq .`, available models will be listed. Expected response will look like this:
36+
37+
```json
38+
{
39+
"data": [
40+
{
41+
"id": "qwen2-0.5b",
42+
"created": 1745327294,
43+
"object": "model",
44+
"owned_by": "Envoy AI Gateway"
45+
},
46+
{
47+
"id": "qwen2.5-coder",
48+
"created": 1745327294,
49+
"object": "model",
50+
"owned_by": "Envoy AI Gateway"
51+
}
52+
],
53+
"object": "list"
54+
}
55+
```
56+
57+
III. Query `http://localhost:8080/v1/chat/completions` to chat with the model. Here, we ask the `qwen2-0.5b` model, the query will look like:
58+
59+
```bash
60+
curl -H "Content-Type: application/json" -d '{
61+
"model": "qwen2-0.5b",
62+
"messages": [
63+
{
64+
"role": "system",
65+
"content": "Hi."
66+
}
67+
]
68+
}' http://localhost:8080/v1/chat/completions | jq .
69+
```
70+
71+
Expected response will look like this:
72+
73+
```json
74+
{
75+
"choices": [
76+
{
77+
"finish_reason": "stop",
78+
"index": 0,
79+
"message": {
80+
"role": "assistant",
81+
"content": "Hello! How can I assist you today?"
82+
}
83+
}
84+
],
85+
"created": 1745327371,
86+
"model": "qwen2-0.5b",
87+
"system_fingerprint": "b5124-bc091a4d",
88+
"object": "chat.completion",
89+
"usage": {
90+
"completion_tokens": 10,
91+
"prompt_tokens": 10,
92+
"total_tokens": 20
93+
},
94+
"id": "chatcmpl-AODlT8xnf4OjJwpQH31XD4yehHLnurr0",
95+
"timings": {
96+
"prompt_n": 1,
97+
"prompt_ms": 319.876,
98+
"prompt_per_token_ms": 319.876,
99+
"prompt_per_second": 3.1262114069201816,
100+
"predicted_n": 10,
101+
"predicted_ms": 1309.393,
102+
"predicted_per_token_ms": 130.9393,
103+
"predicted_per_second": 7.63712651587415
104+
}
105+
}
106+
```

docs/examples/envoy-ai-gateway/README.md

Lines changed: 0 additions & 101 deletions
This file was deleted.

0 commit comments

Comments
 (0)