You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-8Lines changed: 7 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,13 +32,13 @@ Easy, advanced inference platform for large language models on Kubernetes
32
32
33
33
-**Easy of Use**: People can quick deploy a LLM service with minimal configurations.
34
34
-**Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
35
-
-**Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
36
35
-**Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
37
-
-**SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
38
36
-**Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
39
37
-**Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
38
+
-**AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
39
+
-**Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./docs/open-webui.md).
40
40
-**Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
41
-
-**Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), see configurations [here](./docs/open-webui.md).
41
+
-**Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
42
42
43
43
## Quick Start
44
44
@@ -51,7 +51,7 @@ Read the [Installation](./docs/installation.md) for guidance.
51
51
Here's a toy example for deploying `facebook/opt-125m`, all you need to do
52
52
is to apply a `Model` and a `Playground`.
53
53
54
-
If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md), or more [examples](/docs/examples/README.md) here.
54
+
If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md).
55
55
56
56
> Note: if your model needs Huggingface token for weight downloads, please run `kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token>` ahead.
[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open source project for using Envoy Gateway
4
+
to handle request traffic from application clients to Generative AI services.
5
+
6
+
## How to use
7
+
8
+
### 1. Enable Envoy Gateway and Envoy AI Gateway
9
+
10
+
Both of them are enabled by default in `values.global.yaml` and will be deployed in llmaz-system.
11
+
12
+
```yaml
13
+
envoy-gateway:
14
+
enabled: true
15
+
envoy-ai-gateway:
16
+
enabled: true
17
+
```
18
+
19
+
However, [Envoy Gateway](https://gateway.envoyproxy.io/latest/install/install-helm/) and [Envoy AI Gateway](https://aigateway.envoyproxy.io/docs/getting-started/) can be deployed standalone in case you want to deploy them in other namespaces.
20
+
21
+
### 2. Basic AI Gateway Example
22
+
23
+
To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this.
24
+
25
+
We'll deploy two models `Qwen/Qwen2-0.5B-Instruct-GGUF` and `Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF` with llama.cpp (cpu only) and expose them via Envoy AI Gateway.
26
+
27
+
The full example is [here](./examples/envoy-ai-gateway/basic.yaml), apply it.
28
+
29
+
### 3. Check Envoy AI Gateway APIs
30
+
31
+
If Open-WebUI is enabled, you can chat via the webui (recommended), see [documentation](./open-webui.md). Otherwise, following the steps below to test the Envoy AI Gateway APIs.
32
+
33
+
I. Port-forwarding the `LoadBalancer` service in llmaz-system with port 8080.
34
+
35
+
II. Query `http://localhost:8008/v1/models | jq .`, available models will be listed. Expected response will look like this:
36
+
37
+
```json
38
+
{
39
+
"data": [
40
+
{
41
+
"id": "qwen2-0.5b",
42
+
"created": 1745327294,
43
+
"object": "model",
44
+
"owned_by": "Envoy AI Gateway"
45
+
},
46
+
{
47
+
"id": "qwen2.5-coder",
48
+
"created": 1745327294,
49
+
"object": "model",
50
+
"owned_by": "Envoy AI Gateway"
51
+
}
52
+
],
53
+
"object": "list"
54
+
}
55
+
```
56
+
57
+
III. Query `http://localhost:8080/v1/chat/completions` to chat with the model. Here, we ask the `qwen2-0.5b` model, the query will look like:
0 commit comments