Skip to content

EC2 host is being created and abandoned during provisioning / scale up process #5512

Open
@markvzm

Description

@markvzm

/kind bug

What steps did you take and what happened:
We have a cluster autoscaler - https://kubernetes.github.io/autoscaler - process configured to react to compute demand on the fly. This kicks off a MachineDeployment ScaleUp to provision a new EC2 node when needed. In some cases - I could not yet identify additional details when this exactly happens - I can see the following chain of events below, based on the capa-controller-manager logs (I changed the namespace to aws-namespace, the region to aws-region and the node-role to node-role in the logs, also attached):

aws-region-node-role-296hp-hbmpg.log

I0508 13:25:52.345969       1 awsmachine_controller.go:173] "Machine Controller has not yet set OwnerRef" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="e4002ed4-f788-42c1-b81b-ff9ac96f6817"
I0508 13:25:52.362816       1 eksconfig_controller.go:230] "Generating userdata" controller="eksconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="EKSConfig" EKSConfig="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="8139b498-25ed-4ea4-9904-5d0f485d3380"
I0508 13:25:52.374481       1 eksconfig_controller.go:343] "created bootstrap data secret for EKSConfig" controller="eksconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="EKSConfig" EKSConfig="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="8139b498-25ed-4ea4-9904-5d0f485d3380" secret="aws-namespace/aws-region-node-role-296hp-hbmpg"
I0508 13:25:52.391240       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
2025/05/08 13:25:52 http: TLS handshake error from 10.110.47.192:35541: EOF
I0508 13:25:52.417540       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.426121       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.461964       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.463684       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.480710       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.999194       1 awsmachine_controller.go:710] "Creating EC2 instance"
I0508 13:25:53.090255       1 instances.go:135] "Obtained a list of supported architectures for instance type" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="c9a0a0fb-cc13-47a0-8742-9de94c3f7a2f" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"]
I0508 13:25:53.090288       1 instances.go:135] "Chosen architecture" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="c9a0a0fb-cc13-47a0-8742-9de94c3f7a2f" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"] architecture="x86_64"
I0508 13:25:53.549839       1 ami.go:355] "found AMI" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="c9a0a0fb-cc13-47a0-8742-9de94c3f7a2f" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" id="ami-05e2ff94245141e72" version="1.31"
I0508 13:25:55.752052       1 awsmachine_controller.go:569] "EC2 instance state changed" state="pending" instance-id="i-0142b065d1a2fd03b"
I0508 13:25:56.665413       1 awsmachine_controller.go:710] "Creating EC2 instance"
I0508 13:25:56.758880       1 instances.go:135] "Obtained a list of supported architectures for instance type" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="1cea2701-43b7-4373-b773-705bd0fdf49d" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"]
I0508 13:25:56.759810       1 instances.go:135] "Chosen architecture" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="1cea2701-43b7-4373-b773-705bd0fdf49d" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"] architecture="x86_64"
I0508 13:25:56.916564       1 ami.go:355] "found AMI" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="1cea2701-43b7-4373-b773-705bd0fdf49d" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" id="ami-05e2ff94245141e72" version="1.31"
I0508 13:25:59.370289       1 awsmachine_controller.go:569] "EC2 instance state changed" state="pending" instance-id="i-03aa9d4fbd4a1460c"
I0508 13:26:26.945343       1 awsmachine_controller.go:569] "EC2 instance state changed" state="running" instance-id="i-03aa9d4fbd4a1460c"
I0508 13:27:03.861037       1 instances.go:996] "Updating security groups" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="d1de13d8-b44b-429d-bd86-40ab837c07a0" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" groups=["sg-09017c82cf3eb6f03","sg-0a2049172256449c9"]
I0508 13:27:04.303959       1 instances.go:996] "Updating security groups" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="d1de13d8-b44b-429d-bd86-40ab837c07a0" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" groups=["sg-09017c82cf3eb6f03","sg-0a2049172256449c9"]

Interpreting that: One AwsMachine instance - i-0142b065d1a2fd03b - gets created on the management cluster, then the actual host in EC2 by the operator, gets pending state. Within a second, using the same AwsMachine CR name, aws-region-node-role-296hp-hbmpg, another EC2 instance creation is kicked off - i-03aa9d4fbd4a1460c -, and from then on, the controller "abandons" the resources produced by the first one (for ex.: the node CR on the k8s cluster and the EC2 instance) and goes through the process with this second creation using the same AwsMachine (or at least an AwsMachine CR with the same name). By the end of the lifetime of the second instance, i-03aa9d4fbd4a1460c, that and its related resources are properly deleted, but with the first one, i-0142b065d1a2fd03b, the node CR stays in half-baked state on the k8s cluster and the host as "running" in EC2. This first node, the abandoned node, does not have the node-role labels, hence it does not get used by the cluster. Resources related to this have to be clean up manually.

What did you expect to happen:
One AWS host provisioned in EC2 and one node on the k8s cluster properly configured.

Anything else you would like to add:
I have the capi-controller-manager logs as well. Although I did not find much relevant information in there for the case.

Environment:

  • Cluster-api-provider-aws version: v2.6.1
  • Kubernetes version: (use kubectl version):
    Client Version: v1.31.4
    Kustomize Version: v5.4.2
    Server Version: v1.31.4
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS (Focal Fossa)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions