Skip to content

Commit 77a04d1

Browse files
Merge pull request #230 from kerthcet/release/0.0.9
Release/0.0.9
2 parents 6bf48cf + 598c196 commit 77a04d1

File tree

13 files changed

+1254
-435
lines changed

13 files changed

+1254
-435
lines changed

.github/ISSUE_TEMPLATE/new-release.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ Please do not remove items from the checklist
1515
- [ ] Prepare the image and files
1616
- [ ] Run `PLATFORMS=linux/amd64 make image-push GIT_TAG=$VERSION` to build and push an image.
1717
- [ ] Run `make artifacts GIT_TAG=$VERSION` to generate the artifact.
18-
- [ ] Run `make helm-package` to package the helm chart and update the index.yaml.
1918
- [ ] Update `chart/Chart.yaml` and `docs/installation.md`, the helm version is different with the app version.
19+
- [ ] Run `make helm-package` to package the helm chart and update the index.yaml.
2020
- [ ] Submit a PR and merge it.
2121
- [ ] An OWNER [prepares a draft release](https://github.com/inftyai/llmaz/releases)
2222
- [ ] Create a new tag

chart/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ type: application
1313
# This is the chart version. This version number should be incremented each time you make changes
1414
# to the chart and its templates, including the app version.
1515
# Versions are expected to follow Semantic Versioning (https://semver.org/)
16-
version: 0.0.4
16+
version: 0.0.5
1717
# This is the version number of the application being deployed. This version number should be
1818
# incremented each time you make changes to the application. Versions are not expected to
1919
# follow Semantic Versioning. They should reflect the version the application is using.
2020
# It is recommended to use it with quotes.
21-
appVersion: 0.0.8
21+
appVersion: 0.0.9

chart/crds/openmodel-crd.yaml

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -95,28 +95,20 @@ spec:
9595
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
9696
x-kubernetes-int-or-string: true
9797
description: |-
98-
Requests defines the required accelerators to serve the model, like nvidia.com/gpu: 8.
99-
When GPU number is greater than 8, like 32, then multi-host inference is enabled and
100-
32/8=4 hosts will be grouped as an unit, each host will have a resource request as
101-
nvidia.com/gpu: 8. The may change in the future if the GPU number limit is broken.
102-
Not recommended to set the cpu and memory usage here.
103-
If using playground, you can define the cpu/mem usage at backendConfig.
104-
If using service, you can define the cpu/mem at the container resources.
105-
Note: if you define the same accelerator requests at playground/service as well,
98+
Requests defines the required accelerators to serve the model for each replica,
99+
like <nvidia.com/gpu: 8>. For multi-hosts cases, the requests here indicates
100+
the resource requirements for each replica. This may change in the future.
101+
Not recommended to set the cpu and memory usage here:
102+
- if using playground, you can define the cpu/mem usage at backendConfig.
103+
- if using inference service, you can define the cpu/mem at the container resources.
104+
However, if you define the same accelerator requests at playground/service as well,
106105
the requests here will be covered.
107106
type: object
108107
required:
109108
- name
110109
type: object
111110
maxItems: 8
112111
type: array
113-
preheat:
114-
default: false
115-
description: |-
116-
Preheat represents whether we should preload the model, by default will use Manta(https://github.com/InftyAI/Manta)
117-
to preload the model, so you should enable the Manta in prior.
118-
Note: right now, we only support preloading models from Huggingface.
119-
type: boolean
120112
source:
121113
description: |-
122114
Source represents the source of the model, there're several ways to load

chart/crds/playground-crd.yaml

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,21 @@ spec:
4545
BackendRuntimeConfig represents the inference backendRuntime configuration
4646
under the hood, e.g. vLLM, which is the default backendRuntime.
4747
properties:
48-
args:
48+
argFlags:
4949
description: |-
50-
Args represents the arguments appended to the backend.
51-
You can add new args or overwrite the default args.
50+
ArgFlags represents the argument flags appended to the backend.
51+
You can add new flags or overwrite the default flags.
5252
items:
5353
type: string
5454
type: array
55+
argName:
56+
description: |-
57+
ArgName represents the argument name set in the backendRuntimeArg.
58+
If not set, will be derived by the model role, e.g. if one model's role
59+
is <draft>, the argName will be set to <speculative-decoding>. Better to
60+
set the argName explicitly.
61+
By default, the argName will be treated as <default> in runtime.
62+
type: string
5563
envs:
5664
description: Envs represents the environments set to the container.
5765
items:
@@ -214,6 +222,27 @@ spec:
214222
from the default version.
215223
type: string
216224
type: object
225+
elasticConfig:
226+
description: |-
227+
ElasticConfig defines the configuration for elastic usage,
228+
e.g. the max/min replicas. Default to 0 ~ Inf+.
229+
This requires to install the HPA first or will not work.
230+
properties:
231+
maxReplicas:
232+
description: |-
233+
MaxReplicas indicates the maximum number of inference workloads based on the traffic.
234+
Default to nil means there's no limit for the instance number.
235+
format: int32
236+
type: integer
237+
minReplicas:
238+
default: 1
239+
description: |-
240+
MinReplicas indicates the minimum number of inference workloads based on the traffic.
241+
Default to nil means we can scale down the instances to 1.
242+
If minReplicas set to 0, it requires to install serverless component at first.
243+
format: int32
244+
type: integer
245+
type: object
217246
modelClaim:
218247
description: |-
219248
ModelClaim represents claiming for one model, it's a simplified use case

chart/crds/service-crd.yaml

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -43,27 +43,6 @@ spec:
4343
Service controller will maintain multi-flavor of workloads with
4444
different accelerators for cost or performance considerations.
4545
properties:
46-
elasticConfig:
47-
description: |-
48-
ElasticConfig defines the configuration for elastic usage,
49-
e.g. the max/min replicas. Default to 0 ~ Inf+.
50-
This requires to install the HPA first or will not work.
51-
properties:
52-
maxReplicas:
53-
description: |-
54-
MaxReplicas indicates the maximum number of inference workloads based on the traffic.
55-
Default to nil means there's no limit for the instance number.
56-
format: int32
57-
type: integer
58-
minReplicas:
59-
default: 1
60-
description: |-
61-
MinReplicas indicates the minimum number of inference workloads based on the traffic.
62-
Default to nil means we can scale down the instances to 1.
63-
If minReplicas set to 0, it requires to install serverless component at first.
64-
format: int32
65-
type: integer
66-
type: object
6746
modelClaims:
6847
description: ModelClaims represents multiple claims for different
6948
models.

0 commit comments

Comments
 (0)