InftyAI
diff --git a/‎README.md
Lines changed: 7 additions & 8 deletions b/‎README.md
Lines changed: 7 additions & 8 deletions
diff --git a/‎chart/Chart.lock
Lines changed: 8 additions & 2 deletions b/‎chart/Chart.lock
Lines changed: 8 additions & 2 deletions
diff --git a/‎chart/Chart.yaml
Lines changed: 6 additions & 6 deletions b/‎chart/Chart.yaml
Lines changed: 6 additions & 6 deletions
diff --git a/‎chart/values.global.yaml
Lines changed: 1 addition & 1 deletion b/‎chart/values.global.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/envoy-ai-gateway.md
Lines changed: 106 additions & 0 deletions b/‎docs/envoy-ai-gateway.md
Lines changed: 106 additions & 0 deletions
diff --git a/‎docs/examples/envoy-ai-gateway/README.md
Lines changed: 0 additions & 101 deletions b/‎docs/examples/envoy-ai-gateway/README.md
Lines changed: 0 additions & 101 deletions
@@ -32,13 +32,13 @@ Easy, advanced inference platform for large language models on Kubernetes
 
 - **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
 - **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
-- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
 - **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
-- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
 - **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
 - **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
+- **AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
+- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./docs/open-webui.md).
 - **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
-- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), see configurations [here](./docs/open-webui.md).
+- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
 
 ## Quick Start
 
@@ -51,7 +51,7 @@ Read the [Installation](./docs/installation.md) for guidance.
 Here's a toy example for deploying `facebook/opt-125m`, all you need to do
 is to apply a `Model` and a `Playground`.
 
-If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md), or more [examples](/docs/examples/README.md) here.
+If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md).
 
 > Note: if your model needs Huggingface token for weight downloads, please run `kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token>` ahead.
 
@@ -118,14 +118,13 @@ curl http://localhost:8080/v1/completions \
 
 ### More than quick-start
 
-If you want to learn more about this project, please refer to [develop.md](./docs/develop.md).
+Please refer to [examples](./docs/examples/README.md) for more tutorials or read [develop.md](./docs/develop.md) to learn more about the project.
 
 ## Roadmap
 
-- Gateway support for traffic routing
-- Metrics support
 - Serverless support for cloud-agnostic users
-- CLI tool support
+- Prefill-Decode disaggregated serving
+- KV cache offload support
 - Model training, fine tuning in the long-term
 
 ## Community
 
@@ -2,5 +2,11 @@ dependencies:
 - name: open-webui
   repository: https://helm.openwebui.com/
   version: 6.4.0
-digest: sha256:2520f6e26f2e6fd3e51c5f7f940eef94217c125a9828b0f59decedbecddcdb29
-generated: "2025-04-21T00:50:06.532039+08:00"
+- name: gateway-helm
+  repository: oci://registry-1.docker.io/envoyproxy/
+  version: 0.0.0-latest
+- name: ai-gateway-helm
+  repository: oci://registry-1.docker.io/envoyproxy/
+  version: v0.0.0-latest
+digest: sha256:c7b1aa22097a6a1a6f4dd04beed3287ab8ef2ae1aec8a9a4ec7a71251be23e4c
+generated: "2025-04-22T20:15:43.343515+08:00"
@@ -25,11 +25,11 @@ dependencies:
     version: "6.4.0"
     repository: "https://helm.openwebui.com/"
     condition: open-webui.enabled
-  - name: envoy-gateway
-    version: v1.3.2
-    repository: oci://docker.io/envoyproxy/gateway-helm
+  - name: gateway-helm
+    version: 0.0.0-latest
+    repository: "oci://registry-1.docker.io/envoyproxy/"
     condition: envoy-gateway.enabled
-  - name: envoy-ai-gateway
-    version: v0.1.5
-    repository: oci://docker.io/envoyproxy/ai-gateway-helm
+  - name: ai-gateway-helm
+    version: v0.0.0-latest
+    repository: "oci://registry-1.docker.io/envoyproxy/"
     condition: envoy-ai-gateway.enabled
@@ -34,7 +34,7 @@ prometheus:
   enabled: true
 
 open-webui:
-  enabled: false
+  enabled: true
   persistence:
     enabled: false
   enableOpenaiApi: true
 
@@ -0,0 +1,106 @@
+# Envoy AI Gateway
+
+[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open source project for using Envoy Gateway
+to handle request traffic from application clients to Generative AI services.
+
+## How to use
+
+### 1. Enable Envoy Gateway and Envoy AI Gateway
+
+Both of them are enabled by default in `values.global.yaml` and will be deployed in llmaz-system.
+
+```yaml
+envoy-gateway:
+    enabled: true
+envoy-ai-gateway:
+    enabled: true
+```
+
+However, [Envoy Gateway](https://gateway.envoyproxy.io/latest/install/install-helm/) and [Envoy AI Gateway](https://aigateway.envoyproxy.io/docs/getting-started/) can be deployed standalone in case you want to deploy them in other namespaces.
+
+### 2. Basic AI Gateway Example
+
+To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this.
+
+We'll deploy two models `Qwen/Qwen2-0.5B-Instruct-GGUF` and `Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF` with llama.cpp (cpu only) and expose them via Envoy AI Gateway.
+
+The full example is [here](./examples/envoy-ai-gateway/basic.yaml), apply it.
+
+### 3. Check Envoy AI Gateway APIs
+
+If Open-WebUI is enabled, you can chat via the webui (recommended), see [documentation](./open-webui.md). Otherwise, following the steps below to test the Envoy AI Gateway APIs.
+
+I. Port-forwarding the `LoadBalancer` service in llmaz-system with port 8080.
+
+II. Query `http://localhost:8008/v1/models | jq .`, available models will be listed. Expected response will look like this:
+
+```json
+{
+  "data": [
+    {
+      "id": "qwen2-0.5b",
+      "created": 1745327294,
+      "object": "model",
+      "owned_by": "Envoy AI Gateway"
+    },
+    {
+      "id": "qwen2.5-coder",
+      "created": 1745327294,
+      "object": "model",
+      "owned_by": "Envoy AI Gateway"
+    }
+  ],
+  "object": "list"
+}
+```
+
+III. Query `http://localhost:8080/v1/chat/completions` to chat with the model. Here, we ask the `qwen2-0.5b` model, the query will look like:
+
+```bash
+curl -H "Content-Type: application/json"     -d '{
+        "model": "qwen2-0.5b",
+        "messages": [
+            {
+                "role": "system",
+                "content": "Hi."
+            }
+        ]
+    }'     http://localhost:8080/v1/chat/completions | jq .
+```
+
+Expected response will look like this:
+
+```json
+{
+  "choices": [
+    {
+      "finish_reason": "stop",
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Hello! How can I assist you today?"
+      }
+    }
+  ],
+  "created": 1745327371,
+  "model": "qwen2-0.5b",
+  "system_fingerprint": "b5124-bc091a4d",
+  "object": "chat.completion",
+  "usage": {
+    "completion_tokens": 10,
+    "prompt_tokens": 10,
+    "total_tokens": 20
+  },
+  "id": "chatcmpl-AODlT8xnf4OjJwpQH31XD4yehHLnurr0",
+  "timings": {
+    "prompt_n": 1,
+    "prompt_ms": 319.876,
+    "prompt_per_token_ms": 319.876,
+    "prompt_per_second": 3.1262114069201816,
+    "predicted_n": 10,
+    "predicted_ms": 1309.393,
+    "predicted_per_token_ms": 130.9393,
+    "predicted_per_second": 7.63712651587415
+  }
+}
+```