Skip to content

KubernetesPodOperator failures when cluster over utilized #51789

Open
@johnhoran

Description

@johnhoran

Description

If you are running a task in the celery executor and there aren't enough workers slots free to actually pick up the task, then your task will be in a queued state until the workers free up and can pick up the task. If the task is running something using kubernetespodoperator, and the cluster doesn't have enough space to accommodate the pod, then kubernetes will return an error and your task will fail.

Use case/motivation

Ideally the task would remain in queued state until there are enough kubernetes resources to accommodate it, but that seems feels like a massive change.
So instead I'd propose that the task should catch this type of kubernetes exception and go into deferred mode for a configurable amount of time and then retry until the pod gets created. In this scenario the time spent in deferred mode would count against the task timeout, while time spent queued in airflow doesn't, but I'd argue that is better than task failure.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind:featureFeature Requestsneeds-triagelabel for new issues that we didn't triage yetprovider:cncf-kubernetesKubernetes (k8s) provider related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions