Skip to content

Full rewrite of etcd certificates generation #12180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

VannTen
Copy link
Contributor

@VannTen VannTen commented May 2, 2025

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

This is a (WIP) full rewrite of the certificates generation for etcd

The main guiding principles are:

  • private key are generated on their use sites and never fetched anywhere else
  • Certificate signing request and certificates are passed around using ansible registered variables
  • avoid relying on scripts / templated openssl.conf and prefer command line arguments where possible

This should have the following benefits:

  • no need for node specific filename, which will work better with kubeadm assumptions (see Unable to upgrade from 1.29.x to 1.30.x with external etcd setup #11500 discussion for instance)
  • moving private keys around is antithetic to the design of PKI and assymetric crypto
  • This will allow working with private keys which cannot be moved, for instance, where the private key is held by a TPM/HSM.

Out of scope for this PR (but possible follow-ups):

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
This is a work in progress. Roughly :

  • Creating CA, CSR, certificates
  • Adding correct CN / SANs
  • Propagate the names into the rest of kubespray (kubeadm and network plugins directly using etcd as a datastore)

@ant31 @tico88612

Once this is ready this should not merge before 2.28. It's a big change so I'd rather have some time in master before that.

Does this PR introduce a user-facing change?:

action required
All etcd certificates (client, member) no longer include the inventory hostname of the node.
Key and certificates are no longer produced on the first etcd, only on the relevant node

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 2, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from MrFreezeex and yankay May 2, 2025 08:22
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 2, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2025
@VannTen VannTen force-pushed the cleanup/etcd_certs_generation branch from c9c3902 to a2246a0 Compare May 16, 2025 13:56
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 16, 2025
@VannTen VannTen force-pushed the cleanup/etcd_certs_generation branch 2 times, most recently from 3205a72 to 9945db9 Compare May 16, 2025 14:51
VannTen added 5 commits May 16, 2025 16:58
We currently do a lock of back and forth to check if certificates are
present on the nodes, whether they need refreshing, etcd.

This is neither readable nor performant, and mostly bypass the Ansible
control flow for little gains.

It also generate all the private keys on the first etcd and then push
them to each node, which undermines the security of the whole PKI and
prevent usage of non-movable secret (HSM, TPM) (these are still not
possible with the new flow but should be easier to add)

Instead of creating all certificates on first etcd, do this:
- create CA on first etcd
- distribute it on every node
- create key and Certificate Signing Request on each node which need one
  (register the CSR as variable)
- delegate for each node the CSR signing to the first etcd master and
  register the certificate as variable for each node
- put the signed certificate on each node

We don't do accounting of the files per-se, instead relying on changed
status to determine the set which needs to be updated.
@VannTen VannTen force-pushed the cleanup/etcd_certs_generation branch from 9945db9 to dc4d400 Compare May 16, 2025 15:00
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aviral-agarwal
Copy link
Contributor

Hey @VannTen
Just a few notes concerning Cilium as CNI and etcd that may be relevant and helpful for the PR

  • Cilium running on worker nodes does not always need access to etcd.
    That was the case before, but with the updated version, the default is no access to etcd required
  • One configuration (which I am aware of) that will configure Cilium (including Cilium agents on worker nodes) to require etcd access is when identityAllocationMode(Cilium helm)/cilium_identity_allocation_mode(kubespray) is kvstore.
  • The current default value, in both Cilium helm charts and kubespray, is crd

This is by no means an exhaustive list, but should be helpful anyway

Feel free to comment/give feedback/or if I can be of any assistance

Thanks and keeping an eye out for this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants