KubernetesPodOperator failures when cluster over utilized

### Description

If you are running a task in the celery executor and there aren't enough workers slots free to actually pick up the task, then your task will be in a queued state until the workers free up and can pick up the task.  If the task is running something using kubernetespodoperator, and the cluster doesn't have enough space to accommodate the pod, then kubernetes will return an error and your task will fail.

### Use case/motivation

Ideally the task would remain in queued state until there are enough kubernetes resources to accommodate it, but that seems feels like a massive change.  
So instead I'd propose that the task should catch this type of kubernetes exception and go into deferred mode for a configurable amount of time and then retry until the pod gets created.  In this scenario the time spent in deferred mode would count against the task timeout, while time spent queued in airflow doesn't, but I'd argue that is better than task failure.  

### Related issues

_No response_

### Are you willing to submit a PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KubernetesPodOperator failures when cluster over utilized #51789

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KubernetesPodOperator failures when cluster over utilized #51789

Description

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions