You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,13 +40,12 @@ Easy, advanced inference platform for large language models on Kubernetes
40
40
41
41
-**Easy of Use**: People can quick deploy a LLM service with minimal configurations.
42
42
-**Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
43
-
-**Heterogeneous Devices Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Kube-Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
43
+
-**Heterogeneous Cluster Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
44
44
-**Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
45
-
-**Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
45
+
-**Distributed Inference**: Multi-host & homogeneous xPyD support with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.
46
46
-**AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
47
+
-**Scaling Efficiency**: Horizontal Pod scaling with [HPA](./docs/examples/hpa/README.md) with LLM-based metrics and node(spot instance) autoscaling with [Karpenter](https://github.com/kubernetes-sigs/karpenter).
47
48
-**Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./site/content/en/docs/integrations/open-webui.md).
48
-
-**Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
49
-
-**Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
llmaz supports a wide range of advanced inference backends for different scenarios, like <ahref="https://github.com/vllm-project/vllm">vLLM</a>, <ahref="https://github.com/huggingface/text-generation-inference">Text-Generation-Inference</a>, <ahref="https://github.com/sgl-project/sglang">SGLang</a>, <ahref="https://github.com/ggerganov/llama.cpp">llama.cpp</a>. Find the full list of supported backends <ahref="/InftyAI/llmaz/blob/main/docs/support-backends.md">here</a>.
llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
33
33
{{% /blocks/feature %}}
34
34
35
-
{{% blocks/feature icon="fas fa-warehouse" title="Various Model Providers" %}}
35
+
{{% blocks/feature icon="fas fa-list-alt" title="Various Model Providers" %}}
36
36
llmaz supports a wide range of model providers, such as <ahref="https://huggingface.co/"rel="nofollow">HuggingFace</a>, <ahref="https://www.modelscope.cn"rel="nofollow">ModelScope</a>, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
Multi-host & homogeneous xPyD distributed serving support with <ahref="https://github.com/kubernetes-sigs/lws">LWS</a> from day 0. Will implement the heterogeneous xPyD in the future.
Offering capabilities like token-based rate limiting, model routing with the integration of <ahref="https://aigateway.envoyproxy.io/"rel="nofollow">Envoy AI Gateway</a>.
Out-of-the-box chatbot support with the integration of <ahref="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <ahref="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
Horizontal Pod scaling with <ahref="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> based on LLM-focused metrics and node(spot instance) autoscaling with <ahref="https://github.com/kubernetes-sigs/karpenter">Karpenter</a>.
llmaz supports horizontal scaling with <ahref="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> by default and will integrate with autoscaling components like <ahref="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler">Cluster-Autoscaler</a> or <ahref="https://github.com/kubernetes-sigs/karpenter">Karpenter</a> for smart scaling across different clouds.
Out-of-the-box chatbot support with the integration of <ahref="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <ahref="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
53
53
{{% /blocks/feature %}}
54
54
55
-
{{% blocks/feature icon="fas fa-box-open" title="Efficient Model Distribution (WIP)" %}}
56
-
Out-of-the-box model cache system support with <ahref="https://github.com/InftyAI/Manta">Manta</a>, still under development right now with architecture reframing.
55
+
{{% blocks/feature icon="fas fa-ellipsis-h" title="More in the future" %}}
If you want to integrate more backends into llmaz, please refer to this [PR](https://github.com/InftyAI/llmaz/pull/182). It's always welcomed.
@@ -9,6 +9,11 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
9
9
10
10
[llama.cpp](https://github.com/ggerganov/llama.cpp) is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
11
11
12
+
## ollama
13
+
14
+
[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
15
+
16
+
12
17
## SGLang
13
18
14
19
[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
@@ -21,10 +26,6 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
21
26
22
27
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
23
28
24
-
## ollama
25
-
26
-
[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
27
-
28
29
## vLLM
29
30
30
31
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs
Support multi-host & homogeneous xPyD distributed serving with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.
Copy file name to clipboardExpand all lines: site/content/en/docs/features/heterogeneous-cluster-support.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: Heterogeneous Cluster Support
3
-
weight: 1
3
+
weight: 2
4
4
---
5
5
6
6
A `llama2-7B` model can be running on __1xA100__ GPU, also on __1xA10__ GPU, even on __1x4090__ and a variety of other types of GPUs as well, that's what we called resource fungibility. In practical scenarios, we may have a heterogeneous cluster with different GPU types, and high-end GPUs will stock out a lot, to meet the SLOs of the service as well as the cost, we need to schedule the workloads on different GPU types. With the [ResourceFungibility](https://github.com/InftyAI/scheduler-plugins/blob/main/pkg/plugins/resource_fungibility) in the InftyAI scheduler, we can simply achieve this with at most 8 alternative GPU types.
@@ -20,4 +20,4 @@ globalConfig:
20
20
scheduler-name: inftyai-scheduler
21
21
```
22
22
23
-
then run`make helm-upgrade` to install or upgrade llmaz.
23
+
Run`make helm-upgrade` to install or upgrade llmaz.
0 commit comments