What's Changed
- [Core] Do not initialize conda for users if using docker image by @SeungjinYang in #5303
- [AWS] Use consistent list for instance termination logic by @SeungjinYang in #5316
- Fix: validate dag should not block uvicorn by @aylei in #5328
- [UX] Rename file to match import path by @DanielZhangQD in #5349
- update installation docs and Dockerfile to reflect omegaconf by @cg505 in #5356
- add typechecking for boto clients/resources by @cg505 in #5319
- [GCP] add helptext about API srv missing
gcloud
when installed using wget by @SeungjinYang in #5335 - support
all_users
in wait for job status in back compact tests by @zpoint in #5346 - fixes from #5335 by @SeungjinYang in #5361
- [k8s] Hints for querying stale kube current context by @kyuds in #5273
- [Docs] Remove meetup announcement banner by @romilbhardwaj in #5364
- API version schema check for release pipeline by @zpoint in #5309
- [GCP] Support hyperdisk-balanced for a3 series by @JiangJiaWei1103 in #5351
- [k8s] Better error message for stale jobs controller by @kyuds in #5274
- [GH] Fix release action by @Michaelvll in #5373
- [RunPod] Use zone to provision in a specific data center ID by @Kovbo in #5166
- [UX] remove references to SKYPILOT_GLOBAL_CONFIG envvar in docs by @SeungjinYang in #5374
- [aws] fix logic to detect rule with all traffic from security group by @SeungjinYang in #5332
- [UX] Remove credentials from the dashboard URL and update the dashboard build hint by @DanielZhangQD in #5363
- supress warning by setting default value of
asyncio_default_fixture_loop_scope
by @zpoint in #5348 - [Docs] Fix migration guide version by @romilbhardwaj in #5369
- update AWS credential setup docs, consolidate cloud auth docs by @cg505 in #5122
- [k8s] Force terminate misbehaving pods by @romilbhardwaj in #5370
- Fix broken test
test_gcp_disk_tier
by @zpoint in #5393 - Remove
pytest.ini
to remove test warning by @zpoint in #5379 - Refine API server deployment doc by @aylei in #5295
- [Docs] Add API server tuning guide by @aylei in #5176
- Introduce High Availability Service Controller by @andylizf in #4564
- [API server] Fix worker number for non-local low resource env by @aylei in #5409
- [k8s] Fix IPv6 SSH by @kyuds in #5413
- Fix terminating k8s cluster by @aylei in #5412
- [UX]
api info
: display dashboard on last line by @concretevitamin in #5417 - [UX] Minor fix. by @concretevitamin in #5420
- [config] remove omegaconf as dependency by @SeungjinYang in #5375
- [k8s] idea: allow an accelerator to map to multiple label values by @SeungjinYang in #5343
- [Nebius] Don't cache session across multiple requests by @SalikovAlex in #5347
- [Nebius] Add Docker support for Nebius cloud by @SalikovAlex in #5334
- Qwen3 235b example by @Michaelvll in #5425
- [UX][k8s] show-gpus for all allowed contexts by @kyuds in #5362
- [API server] make server config conherent by @aylei in #5414
- [Catalog] use v7 for latest runpod by @Michaelvll in #5422
- [UX] Update dashboard favicon with transparent background by @DanielZhangQD in #5426
- [Core][RunPod] Show error for RunPod multi-node by @kyuds in #5368
- Support SDK backward compatibility test by @zpoint in #5398
- Add helm support for RunPod credentials by @funkypenguin in #5214
- [aws] script to get default security group name for aws by @SeungjinYang in #5427
- Update pypi description by @Michaelvll in #5444
- avoid using removed LEGACY_SINGLETON_REGION constant by @cg505 in #5441
- [Docs] Fix DWS/Kueue title and URL by @Michaelvll in #5443
- release pipeline trigger filter based on name by @zpoint in #5367
- [Doc] Add runpod credentials setup for API server by @aylei in #5433
- Fix flaky of
test_multi_echo
-- change sshd config to support large number of jobs by @zpoint in #5323 - Support launch controller and jobs on different cloud for smoke test by @zpoint in #5435
- (Helm chart) Add configurable ingress host by @turtlebasket in #5452
- [Example] AWS EFA Example by @KeplerC in #5318
- [runpod] preserve docker configured environment variables by @SeungjinYang in #5451
- [Docs] Clarify Nebius credential setup by @Michaelvll in #5298
- [k8s] gpu bin packing via affinity by @SeungjinYang in #5423
- [docs] leave in instructions to deal with omegaconf until next stable release by @SeungjinYang in #5460
- [k8s] CPU only jobs to prefer nodes without GPUs by @SeungjinYang in #5357
- Fix failure of
test_kubernetes_context_failover
by @zpoint in #5455 - Fix flaky of
test_cancel_launch_and_exec_async
by @zpoint in #5456 - [GCP] Remap series-specific disk types by @JiangJiaWei1103 in #5457
- add task envs to event_callback by @ggilley in #5474
- [Nebius] Conditionally mount AWS credential files for Nebius profile by @SalikovAlex in #5464
- Remove upper limit on urllib3 version by @vnavkal in #5469
- fix controller cluster name breaking by @cg505 in #5482
- [UX][k8s] backwards compatibility for k8s show-gpus by @kyuds in #5488
- [UX] Make
sky check
parallel by @kyuds in #5483 - reload AWS_SESSION_TOKEN and KUBECONFIG on local API server by @cg505 in #5478
- [jobs/serve] validate controller name before updating value by @cg505 in #5486
- [Nebius] Add support config file and remove hardcode by @SalikovAlex in #5463
- [k8s] do not consider nodes with exact cpu/mem requirements by @SeungjinYang in #5481
- [docs] snippet on multi node jobs in k8s by @SeungjinYang in #5495
- [k8s] fix helm chart deployment of API server by @SeungjinYang in #5507
- Release pipeline refactor - automated release by @zpoint in #5470
- Use more specific header name by @colinjc in #5515
- remove API version bump, add bw compatibility code by @SeungjinYang in #5522
- chore: minor fix to api server documentation by @SeungjinYang in #5512
- Reload AWS default profile for local API server by @aylei in #5511
- [Examples] Llama 3.1 lora finetuning torch version pin by @romilbhardwaj in #5531
- [Docker] Add private docker registry by @Michaelvll in #5526
- Add optional version parameter to docker build pipeline to prevent version mismatch by @zpoint in #5525
- [GCP] Correctly delete cpu mig instance by @Michaelvll in #5524
- [Docs] move
ordered
to a section by @zpoint in #5540 - [Docs] Add a few tutorial links. by @concretevitamin in #5546
- [Docs] Minor fix to the container tabs by @Michaelvll in #5551
- [k8s] New UX for
show-gpus
, and add back the 0 GPU nodes by @Michaelvll in #5490 - [Docs] Fix quote for URL by @Michaelvll in #5516
- [LLM] Update gemma example by @Michaelvll in #5071
- [Dashboard] Add vscode connection snippet by @romilbhardwaj in #5552
- [Nebius] Add support for internal IP usage in Nebius configurations by @SalikovAlex in #5513
- [docs] add docs for setting up TCP IAP tunneling by @cg505 in #5549
- Optimize the throughput of websocket proxy by @aylei in #5539
- [API server] handle logs request in coroutine by @aylei in #5366
- [Doc] Add link to local kind deployment doc by @DanielZhangQD in #5322
- [Doc] Add helm values spec reference by @aylei in #5415
- [Core] Support gcp gpu direct tcpx by @DanielZhangQD in #5553
- [docker] update docker login creds when relaunching cluster by @cg505 in #5559
- [Nebius] add ~/.ssh/sky-cluster-key by @SalikovAlex in #5564
- [GCP] Add H200 to catalog by @Michaelvll in #5101
- Revert "[API server] handle logs request in coroutine" by @aylei in #5580
- [Doc] Update docs for efa by @DanielZhangQD in #5471
- Add helm release in release pipeline by @aylei in #5568
- [optimizer] chore: only check clouds again if there is a cloud to check by @SeungjinYang in #5541
- [API server] handle logs request in coroutine by @aylei in #5582
- remove mypy exclusion by @zpoint in #5528
- [UX] Update dashboard routes and verify test package before releasing by @DanielZhangQD in #5394
- [Core] Add backward compatibility for instance naming in k8s by @DanielZhangQD in #5581
- allow specifying autostop in resources by @cg505 in #5577
- Bump image tag in helm default values by @aylei in #5584
- [Build] Fix check script in build actions by @DanielZhangQD in #5594
- Fix
test_gcp_disk_tier
by @zpoint in #5592 - Release branch name fix by @zpoint in #5595
- Release 0.9.3 by @github-actions in #5598
New Contributors
- @turtlebasket made their first contribution in #5452
- @ggilley made their first contribution in #5474
- @vnavkal made their first contribution in #5469
Full Changelog: v0.9.2...v0.9.3