-
-
Notifications
You must be signed in to change notification settings - Fork 37
Description
This is added to OSSP program, so you need to go to https://summer-ospp.ac.cn/org/prodetail/257c80102?list=org&navpage=org to know the details.
(1) Background: llmaz is an inference platform open-source project based on large language models (https://github.com/InftyAI/llmaz), aimed at providing efficient model inference and service capabilities. With the rise of cloud computing, Serverless architectures have become an ideal choice for optimizing cost and performance due to their on-demand resource allocation and automatic elastic scaling. KEDA (Kubernetes Event-Driven Autoscaling) is an event-driven autoscaling framework integrated with Kubernetes, supporting various triggers for elastic scaling.
(2) Existing Work: Currently, llmaz relies on manually configured Kubernetes clusters for resource management and deployment, with model inference services supported by statically allocated Pods. KEDA is widely used in the Kubernetes ecosystem, supporting autoscaling based on CPU, memory, or external events (e.g., message queues), but llmaz has not yet integrated KEDA or implemented Serverless capabilities.
What would you like to be added:
(4) Desired Improvements: By integrating KEDA, llmaz can achieve automatic scaling and shrinking based on workload events (e.g., HTTP request volume, queue message count), supporting dynamic adjustments from zero to multiple instances. This will optimize resource allocation, reduce waste during idle periods, improve response performance under high loads, and simplify operational workflows.
(5) Ultimate Goal: Implement Serverless elastic scaling for llmaz using KEDA, enabling event-driven autoscaling, optimizing resource utilization, and providing dynamic instance management from zero to one. The goal is to build an efficient, cost-effective, and easy-to-maintain Serverless large language model inference service platform.
Why is this needed:
(3) Shortcomings: The current deployment approach of llmaz requires pre-allocated fixed resources, which cannot dynamically adjust based on actual workload, leading to low resource utilization or increased response latency during high loads. Additionally, manually managing Pod scaling and reduction increases operational complexity, and the lack of zero-to-one elasticity prevents achieving a true Serverless architecture.
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.
- Integrate KEDA with llmaz to enable event-driven autoscaling functionality.
- Develop lightweight KEDA trigger configurations for HTTP request and queue load scaling.
- Provide zero-to-one Serverless instance management for llmaz.
- Automatically optimize llmaz resource allocation and reclamation based on KEDA.
- Write llmaz Serverless deployment documentation and produce performance test reports.
@pacoxu will be the mentor of this task.