EC2 host is being created and abandoned during provisioning / scale up process

/kind bug

**What steps did you take and what happened:**
We have a cluster autoscaler - https://kubernetes.github.io/autoscaler - process configured to react to compute demand on the fly. This kicks off a MachineDeployment ScaleUp to provision a new EC2 node when needed. In some cases - I could not yet identify additional details when this exactly happens - I can see the following chain of events below, based on the capa-controller-manager logs (I changed the namespace to `aws-namespace`, the region to `aws-region` and the node-role to `node-role` in the logs, also attached):

[aws-region-node-role-296hp-hbmpg.log](https://github.com/user-attachments/files/20490489/aws-region-node-role-296hp-hbmpg.log)

```
I0508 13:25:52.345969       1 awsmachine_controller.go:173] "Machine Controller has not yet set OwnerRef" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="e4002ed4-f788-42c1-b81b-ff9ac96f6817"
I0508 13:25:52.362816       1 eksconfig_controller.go:230] "Generating userdata" controller="eksconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="EKSConfig" EKSConfig="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="8139b498-25ed-4ea4-9904-5d0f485d3380"
I0508 13:25:52.374481       1 eksconfig_controller.go:343] "created bootstrap data secret for EKSConfig" controller="eksconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="EKSConfig" EKSConfig="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="8139b498-25ed-4ea4-9904-5d0f485d3380" secret="aws-namespace/aws-region-node-role-296hp-hbmpg"
I0508 13:25:52.391240       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
2025/05/08 13:25:52 http: TLS handshake error from 10.110.47.192:35541: EOF
I0508 13:25:52.417540       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.426121       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.461964       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.463684       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.480710       1 awsmachine_controller.go:486] "Bootstrap data secret reference is not yet available"
I0508 13:25:52.999194       1 awsmachine_controller.go:710] "Creating EC2 instance"
I0508 13:25:53.090255       1 instances.go:135] "Obtained a list of supported architectures for instance type" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="c9a0a0fb-cc13-47a0-8742-9de94c3f7a2f" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"]
I0508 13:25:53.090288       1 instances.go:135] "Chosen architecture" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="c9a0a0fb-cc13-47a0-8742-9de94c3f7a2f" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"] architecture="x86_64"
I0508 13:25:53.549839       1 ami.go:355] "found AMI" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="c9a0a0fb-cc13-47a0-8742-9de94c3f7a2f" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" id="ami-05e2ff94245141e72" version="1.31"
I0508 13:25:55.752052       1 awsmachine_controller.go:569] "EC2 instance state changed" state="pending" instance-id="i-0142b065d1a2fd03b"
I0508 13:25:56.665413       1 awsmachine_controller.go:710] "Creating EC2 instance"
I0508 13:25:56.758880       1 instances.go:135] "Obtained a list of supported architectures for instance type" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="1cea2701-43b7-4373-b773-705bd0fdf49d" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"]
I0508 13:25:56.759810       1 instances.go:135] "Chosen architecture" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="1cea2701-43b7-4373-b773-705bd0fdf49d" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" instance type="m6a.4xlarge" supported architectures=["x86_64"] architecture="x86_64"
I0508 13:25:56.916564       1 ami.go:355] "found AMI" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="1cea2701-43b7-4373-b773-705bd0fdf49d" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" id="ami-05e2ff94245141e72" version="1.31"
I0508 13:25:59.370289       1 awsmachine_controller.go:569] "EC2 instance state changed" state="pending" instance-id="i-03aa9d4fbd4a1460c"
I0508 13:26:26.945343       1 awsmachine_controller.go:569] "EC2 instance state changed" state="running" instance-id="i-03aa9d4fbd4a1460c"
I0508 13:27:03.861037       1 instances.go:996] "Updating security groups" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="d1de13d8-b44b-429d-bd86-40ab837c07a0" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" groups=["sg-09017c82cf3eb6f03","sg-0a2049172256449c9"]
I0508 13:27:04.303959       1 instances.go:996] "Updating security groups" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="aws-namespace/aws-region-node-role-296hp-hbmpg" namespace="aws-namespace" name="aws-region-node-role-296hp-hbmpg" reconcileID="d1de13d8-b44b-429d-bd86-40ab837c07a0" machine="aws-namespace/aws-region-node-role-296hp-hbmpg" cluster="aws-namespace/aws-region" groups=["sg-09017c82cf3eb6f03","sg-0a2049172256449c9"]

```

Interpreting that: One AwsMachine instance - i-0142b065d1a2fd03b - gets created on the management cluster, then the actual host in EC2 by the operator, gets pending state. Within a second, using the same AwsMachine CR name, aws-region-node-role-296hp-hbmpg, another EC2 instance creation is kicked off - i-03aa9d4fbd4a1460c -, and from then on, the controller "abandons" the resources produced by the first one (for ex.: the node CR on the k8s cluster and the EC2 instance) and goes through the process with this second creation using the same AwsMachine (or at least an AwsMachine CR with the same name). By the end of the lifetime of the second instance, i-03aa9d4fbd4a1460c, that and its related resources are properly deleted, but with the first one, i-0142b065d1a2fd03b, the node CR stays in half-baked state on the k8s cluster and the host as "running" in EC2. This first node, the abandoned node, does not have the node-role labels, hence it does not get used by the cluster. Resources related to this have to be clean up manually.

**What did you expect to happen:**
One AWS host provisioned in EC2 and one node on the k8s cluster properly configured.

**Anything else you would like to add:**
I have the capi-controller-manager logs as well. Although I did not find much relevant information in there for the case.


**Environment:**

- Cluster-api-provider-aws version: v2.6.1
- Kubernetes version: (use `kubectl version`):
Client Version: v1.31.4
Kustomize Version: v5.4.2
Server Version: v1.31.4
- OS (e.g. from `/etc/os-release`): Ubuntu 20.04.6 LTS (Focal Fossa)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EC2 host is being created and abandoned during provisioning / scale up process #5512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EC2 host is being created and abandoned during provisioning / scale up process #5512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions