Description
Bug Report
This is likely not a JOSDK bug but based on offline discussion with @csviri I am opening it here to track it.
In our current setup the operator is deployed in namespace x
and is watching namespace y
. The access to namespace y
is controlled by roles and rolebindings (created in namespace y
).
If there are CRs present in y
and the namespace is deleted before the CRs are individually deleted we get the following exception during cleanup:
ERROR][flink/basic-example] Error during event processing ExecutionScope{ resource id: ResourceID{name='basic-example', namespace='flink'}, version: 1791281} failed.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:546)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:566)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:369)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:712)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:172)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:177)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:88)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:39)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher$CustomResourceFacade.updateResource(ReconciliationDispatcher.java:387)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.conflictRetryingUpdate(ReconciliationDispatcher.java:343)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleCleanup(ReconciliationDispatcher.java:297)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:87)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:701)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:681)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:628)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:591)
at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$5(StandardHttpClient.java:120)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:135)
... 3 more
Furthermore the namespace deletion gets stuck because the finalizer from the CR is never removed. The root problem seems to be when the namespace deletion is initiated the role and rolebinding is immediately deleted therefore the operator cannot remove the finalizer from the resource anymore.
Environment
Kubernetes cluster type:
kind
JOSDK version: 4.3.0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-20T03:36:50Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/arm64"}
Activity
csviri commentedon Apr 27, 2023
Yep this is more like a generic Kubernetes issue, but will clarify how to handle it here, since we have a feature (dynamic changes of watching namespaces) that is closely related.
rmetzger commentedon May 5, 2023
Wouldn't setting a finalizer for the role and rolebinding solve the problem of immediate deletion?
csviri commentedon May 5, 2023
yes, this sounds like a good idea.
This was suggested also here:
https://kubernetes.slack.com/archives/CAW0GV7A5/p1682603213236239
But will create an issue in Kubernetes, see if it can be solved eventually on GC controller level.
csviri commentedon May 5, 2023
What JOSDK could do is to provide reconcilers (one for role and one for rolebinding) that will handle adding finalizers and removing them, and it would up to the dev to register them them. Since this has also implication on permissions of the operator (update permission on role).
moayad-alyaghshi commentedon Aug 17, 2023
Hi @csviri
we are facing the same issue that the namespace deletion is stuck, but even when the operator is deployed in the same namespace as the CRs, which is not expected according to what I understood from the Slack thread. I would appreciate any explanation.
Note: The operator has a ClusterRole and ClusterRoleBinding to work with the CRs. We're using Quarkus with quarkus-operator-sdk.
csviri commentedon Aug 17, 2023
Hi @moayad-alyaghshi ,
I checked it briefly in namespace controller and the garbage collector controller when @gyfora reported this, and it seems (well as far I was able to see) there is nothing special to prevent this in K8S to happen even in the same namespace.
So this is not an issue with JOSDK, it's rather issue with K8S. What we can offer is that reconciler that solves this, just was not priority for now, scheduled this for 4.5;
Maybe it is worth asking again around this on k8s slack: https://kubernetes.slack.com/archives/CAW0GV7A5
23 remaining items