Description
Description
If you are running a task in the celery executor and there aren't enough workers slots free to actually pick up the task, then your task will be in a queued state until the workers free up and can pick up the task. If the task is running something using kubernetespodoperator, and the cluster doesn't have enough space to accommodate the pod, then kubernetes will return an error and your task will fail.
Use case/motivation
Ideally the task would remain in queued state until there are enough kubernetes resources to accommodate it, but that seems feels like a massive change.
So instead I'd propose that the task should catch this type of kubernetes exception and go into deferred mode for a configurable amount of time and then retry until the pod gets created. In this scenario the time spent in deferred mode would count against the task timeout, while time spent queued in airflow doesn't, but I'd argue that is better than task failure.
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct