Skip to content

Commit f159c34

Browse files
authored
Update documentation (#449)
* Update readme.md Signed-off-by: kerthcet <[email protected]> * Update documentation Signed-off-by: kerthcet <[email protected]> --------- Signed-off-by: kerthcet <[email protected]>
1 parent 09332c8 commit f159c34

File tree

7 files changed

+30
-25
lines changed

7 files changed

+30
-25
lines changed

README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,12 @@ Easy, advanced inference platform for large language models on Kubernetes
4040

4141
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
4242
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
43-
- **Heterogeneous Devices Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Kube-Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
43+
- **Heterogeneous Cluster Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
4444
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
45-
- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
45+
- **Distributed Inference**: Multi-host & homogeneous xPyD support with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.
4646
- **AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
47+
- **Scaling Efficiency**: Horizontal Pod scaling with [HPA](./docs/examples/hpa/README.md) with LLM-based metrics and node(spot instance) autoscaling with [Karpenter](https://github.com/kubernetes-sigs/karpenter).
4748
- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./site/content/en/docs/integrations/open-webui.md).
48-
- **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
49-
- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
5049

5150
## Quick Start
5251

site/content/en/_index.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -24,36 +24,35 @@ title: llmaz
2424
People can quick deploy a LLM service with minimal configurations.
2525
{{% /blocks/feature %}}
2626

27-
{{% blocks/feature icon="fas fa-cogs" title="Broad Backends Support" %}}
27+
{{% blocks/feature icon="fas fa-cubes" title="Broad Backends Support" %}}
2828
llmaz supports a wide range of advanced inference backends for different scenarios, like <a href="https://github.com/vllm-project/vllm">vLLM</a>, <a href="https://github.com/huggingface/text-generation-inference">Text-Generation-Inference</a>, <a href="https://github.com/sgl-project/sglang">SGLang</a>, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a>. Find the full list of supported backends <a href="/InftyAI/llmaz/blob/main/docs/support-backends.md">here</a>.
2929
{{% /blocks/feature %}}
3030

31-
{{% blocks/feature icon="fas fa-exchange-alt" title="Accelerator Fungibility" %}}
31+
{{% blocks/feature icon="fas fa-random" title="Heterogeneous Cluster Support" %}}
3232
llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
3333
{{% /blocks/feature %}}
3434

35-
{{% blocks/feature icon="fas fa-warehouse" title="Various Model Providers" %}}
35+
{{% blocks/feature icon="fas fa-list-alt" title="Various Model Providers" %}}
3636
llmaz supports a wide range of model providers, such as <a href="https://huggingface.co/" rel="nofollow">HuggingFace</a>, <a href="https://www.modelscope.cn" rel="nofollow">ModelScope</a>, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
3737
{{% /blocks/feature %}}
3838

39-
{{% blocks/feature icon="fas fa-network-wired" title="Multi-Host Support" %}}
40-
llmaz supports both single-host and multi-host scenarios with <a href="https://github.com/kubernetes-sigs/lws">LWS</a> from day 0.
39+
{{% blocks/feature icon="fas fa-sitemap" title="Distributed Serving" %}}
40+
Multi-host & homogeneous xPyD distributed serving support with <a href="https://github.com/kubernetes-sigs/lws">LWS</a> from day 0. Will implement the heterogeneous xPyD in the future.
4141
{{% /blocks/feature %}}
4242

4343
{{% blocks/feature icon="fas fa-door-open" title="AI Gateway Support" %}}
4444
Offering capabilities like token-based rate limiting, model routing with the integration of <a href="https://aigateway.envoyproxy.io/" rel="nofollow">Envoy AI Gateway</a>.
4545
{{% /blocks/feature %}}
4646

47-
{{% blocks/feature icon="fas fa-comments" title="Build-in ChatUI" %}}
48-
Out-of-the-box chatbot support with the integration of <a href="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <a href="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
47+
{{% blocks/feature icon="fas fa-expand-arrows-alt" title="Scaling Efficiency" %}}
48+
Horizontal Pod scaling with <a href="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> based on LLM-focused metrics and node(spot instance) autoscaling with <a href="https://github.com/kubernetes-sigs/karpenter">Karpenter</a>.
4949
{{% /blocks/feature %}}
5050

51-
{{% blocks/feature icon="fas fa-expand-arrows-alt" title="Scaling Efficiency" %}}
52-
llmaz supports horizontal scaling with <a href="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> by default and will integrate with autoscaling components like <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler">Cluster-Autoscaler</a> or <a href="https://github.com/kubernetes-sigs/karpenter">Karpenter</a> for smart scaling across different clouds.
51+
{{% blocks/feature icon="fas fa-comments" title="Build-in ChatUI" %}}
52+
Out-of-the-box chatbot support with the integration of <a href="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <a href="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
5353
{{% /blocks/feature %}}
5454

55-
{{% blocks/feature icon="fas fa-box-open" title="Efficient Model Distribution (WIP)" %}}
56-
Out-of-the-box model cache system support with <a href="https://github.com/InftyAI/Manta">Manta</a>, still under development right now with architecture reframing.
55+
{{% blocks/feature icon="fas fa-ellipsis-h" title="More in the future" %}}
5756
{{% /blocks/feature %}}
5857

5958
{{% /blocks/section %}}

site/content/en/docs/develop.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Develop Guidance
3-
weight: 3
3+
weight: 4
44
description: >
55
This section contains a develop guidance for people who want to learn more about this project.
66
---
Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Supported Inference Backends
3-
weight: 5
2+
title: Broad Inference Backends Support
3+
weight: 1
44
---
55

66
If you want to integrate more backends into llmaz, please refer to this [PR](https://github.com/InftyAI/llmaz/pull/182). It's always welcomed.
@@ -9,6 +9,11 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
99

1010
[llama.cpp](https://github.com/ggerganov/llama.cpp) is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
1111

12+
## ollama
13+
14+
[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
15+
16+
1217
## SGLang
1318

1419
[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
@@ -21,10 +26,6 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
2126

2227
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
2328

24-
## ollama
25-
26-
[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
27-
2829
## vLLM
2930

3031
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
title: Distributed Inference
3+
weight: 3
4+
---
5+
6+
Support multi-host & homogeneous xPyD distributed serving with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.

site/content/en/docs/features/heterogeneous-cluster-support.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Heterogeneous Cluster Support
3-
weight: 1
3+
weight: 2
44
---
55

66
A `llama2-7B` model can be running on __1xA100__ GPU, also on __1xA10__ GPU, even on __1x4090__ and a variety of other types of GPUs as well, that's what we called resource fungibility. In practical scenarios, we may have a heterogeneous cluster with different GPU types, and high-end GPUs will stock out a lot, to meet the SLOs of the service as well as the cost, we need to schedule the workloads on different GPU types. With the [ResourceFungibility](https://github.com/InftyAI/scheduler-plugins/blob/main/pkg/plugins/resource_fungibility) in the InftyAI scheduler, we can simply achieve this with at most 8 alternative GPU types.
@@ -20,4 +20,4 @@ globalConfig:
2020
scheduler-name: inftyai-scheduler
2121
```
2222
23-
then run `make helm-upgrade` to install or upgrade llmaz.
23+
Run `make helm-upgrade` to install or upgrade llmaz.

site/content/en/docs/reference/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Reference
3-
weight: 4
3+
weight: 5
44
description: >
55
This section contains the llmaz reference information.
66
menu:

0 commit comments

Comments
 (0)