Skip to content

Namespace deletion stuck if contains CRs that are watched by the operator #1876

Open
@gyfora

Description

@gyfora

Bug Report

This is likely not a JOSDK bug but based on offline discussion with @csviri I am opening it here to track it.

In our current setup the operator is deployed in namespace x and is watching namespace y. The access to namespace y is controlled by roles and rolebindings (created in namespace y).

If there are CRs present in y and the namespace is deleted before the CRs are individually deleted we get the following exception during cleanup:

ERROR][flink/basic-example] Error during event processing ExecutionScope{ resource id: ResourceID{name='basic-example', namespace='flink'}, version: 1791281} failed.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
    at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:546)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:566)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:369)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:712)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:172)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:177)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:88)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:39)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher$CustomResourceFacade.updateResource(ReconciliationDispatcher.java:387)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.conflictRetryingUpdate(ReconciliationDispatcher.java:343)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleCleanup(ReconciliationDispatcher.java:297)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:87)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
    at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:701)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:681)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:628)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:591)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$5(StandardHttpClient.java:120)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:135)
    ... 3 more

Furthermore the namespace deletion gets stuck because the finalizer from the CR is never removed. The root problem seems to be when the namespace deletion is initiated the role and rolebinding is immediately deleted therefore the operator cannot remove the finalizer from the resource anymore.

Environment

Kubernetes cluster type:

kind

JOSDK version: 4.3.0

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-20T03:36:50Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/arm64"}

Activity

self-assigned this
on Apr 27, 2023
csviri

csviri commented on Apr 27, 2023

@csviri
Collaborator

Yep this is more like a generic Kubernetes issue, but will clarify how to handle it here, since we have a feature (dynamic changes of watching namespaces) that is closely related.

rmetzger

rmetzger commented on May 5, 2023

@rmetzger

when the namespace deletion is initiated the role and rolebinding is immediately deleted

Wouldn't setting a finalizer for the role and rolebinding solve the problem of immediate deletion?

csviri

csviri commented on May 5, 2023

@csviri
Collaborator

when the namespace deletion is initiated the role and rolebinding is immediately deleted

Wouldn't setting a finalizer for the role and rolebinding solve the problem of immediate deletion?

yes, this sounds like a good idea.

This was suggested also here:
https://kubernetes.slack.com/archives/CAW0GV7A5/p1682603213236239

But will create an issue in Kubernetes, see if it can be solved eventually on GC controller level.

csviri

csviri commented on May 5, 2023

@csviri
Collaborator

What JOSDK could do is to provide reconcilers (one for role and one for rolebinding) that will handle adding finalizers and removing them, and it would up to the dev to register them them. Since this has also implication on permissions of the operator (update permission on role).

added this to the 4.4 milestone on May 9, 2023
modified the milestones: 4.4, 5.0 on Jun 27, 2023
moayad-alyaghshi

moayad-alyaghshi commented on Aug 17, 2023

@moayad-alyaghshi

Hi @csviri

we are facing the same issue that the namespace deletion is stuck, but even when the operator is deployed in the same namespace as the CRs, which is not expected according to what I understood from the Slack thread. I would appreciate any explanation.

Note: The operator has a ClusterRole and ClusterRoleBinding to work with the CRs. We're using Quarkus with quarkus-operator-sdk.

modified the milestones: 5.0, 4.5 on Aug 17, 2023
csviri

csviri commented on Aug 17, 2023

@csviri
Collaborator

Hi @moayad-alyaghshi ,

I checked it briefly in namespace controller and the garbage collector controller when @gyfora reported this, and it seems (well as far I was able to see) there is nothing special to prevent this in K8S to happen even in the same namespace.

So this is not an issue with JOSDK, it's rather issue with K8S. What we can offer is that reconciler that solves this, just was not priority for now, scheduled this for 4.5;

Maybe it is worth asking again around this on k8s slack: https://kubernetes.slack.com/archives/CAW0GV7A5

removed this from the 4.5 milestone on Oct 3, 2023

23 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

    Participants

    @rmetzger@csviri@gyfora@jessebye@moayad-alyaghshi

    Issue actions

      Namespace deletion stuck if contains CRs that are watched by the operator · Issue #1876 · operator-framework/java-operator-sdk