Skip to content

Use systemd Boot Assessment #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

danyspin97
Copy link

Automatic Boot Assessment allows systemd-boot and systemd to mark boot entries as either good or bad, depending on if they can boot successfully or not.

This PR changes health-checker into a service that is part of the automatic boot assessment, by using the special target boot-complete.target. When systemd-boot and /etc/kernel/tries is greater than 0, the current boot entry get renamed to start the counting.

If health-checker tests pass without errors, then the boot entry is marked as good by systemd-bless-boot. If there is any error, then health-checker decides if there should be a reboot, or it should start an emergency shell. If the current entry is the default one and the entry has still some tries left (i.e. it has not been marked as bad), then reboot. If the current entry is the default one and there are no tries left, then start an emergency shell. The default entry will be picked among the one that are known to work or haven't been tested yet, so the emergency shell is only started when all entries have been tried (this could lead to many reboots). If the user choose an entry instead of letting systemd-boot pick the default one, then health-checker will not reboot by default (this can be enforced with the argument below).

I have also added two kernel cmdline arguments to fix #8 :

  • health-checker-reboot:
    • force: always reboot when health-checker fails and the loaded boot entry is not the default one
    • disable: health-checker never reboots
  • health-checker=disabled: skill all tests and mark health-checker as successful. This breaks systemd Automatic Boot Counting but helps with debugging or some edge cases.

Requirements

  • /etc/kernel/tries to have a number greater than 0. Currently, I am shipping this file in the health-checker package.

Current blockers

  • systemd-bless-boot cannot rename the boot entries, due to selinux enforcing policy. Bug tracked here.

@Vogtinator
Copy link
Member

What do we do on platforms without EFI vars? Just declare them unsupported by this mechanism?

@danyspin97
Copy link
Author

What do we do on platforms without EFI vars? Just declare them unsupported by this mechanism?

This version uses EFI vars for simplicity, since it makes it easier to retrieve the current and default boot entries. This can work with any bootloader as long as we know this info and Automatic Boot Assessment is supported.

@danyspin97 danyspin97 marked this pull request as ready for review December 2, 2024 14:53
@danyspin97
Copy link
Author

I replaced the EFI variables by reimplementing Automatic Boot Assessment logic in health-checker. I calculate the default one as being the first entry, descending order based on the name, that also has not been disabled by the boot counting. It is quite bare bones, but it works. One possible issue would be having different kernel versions, then health-checker would require a more robust parser. For detecting the current entry, I am checking the snapshot version of the current mounted snapshot.

@danyspin97
Copy link
Author

I thought a little bit more on the approach I have taken in this PR. bootctl can set the default entry by changing the EFI variables, so I'd go back to reading the EFI variables first. I still think the bash implementation of the BLS logic for choosing the default is good as fallback for systems that don't support EFI vars.

@aplanas
Copy link

aplanas commented Dec 11, 2024

bootctl follows the BLS and also the BLI, that describes the set of EFI variables that the bootloader will follow. Because grub2-bls does not really follow this last BLI specification, sdbootutil needed to re-implement set/get-default and set/get-timeout for BLI and non-BLI bootloaders.

For example, if we are using systemd-boot in an architecture that does not has EFI variables, it will set the configuration in the loader.conf file in the ESP, and if we are in a grub2-bls system, then will set the grubenv and the EFI variable (or loader.conf), so bootctl information will always read the correct information for both bootloaders.

My recommendation is to follow this path, or use sdbootutil set-default and get-default to abstract this part.

@thkukuk
Copy link
Contributor

thkukuk commented Mar 28, 2025

Please adjust the README.md and manual page, especially with the now missing state file and the new kernel commandline options.
Else looks good to me.

Copy link
Member

@Vogtinator Vogtinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep without -q pollutes stdout

README.md Outdated
@@ -2,34 +2,47 @@

Check the state of a openSUSE MicroOS system after a reboot.

## Configure

All services that should be checked, need to be listed in the 'After' section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All services that should be checked, need to be listed in the 'After' section.
All services that should be checked need to be listed in the 'After' section.

@danyspin97 danyspin97 force-pushed the sdboot branch 4 times, most recently from f66e61f to 1c4cdfb Compare May 16, 2025 10:18

Verified

This commit was signed with the committer’s verified signature.
danyspin97 Danilo Spinella
installed by system packages (and therefore coming through an RPM), the latter includes
plugins installed manually by the system admin. Every plugin is responsible to check
a special service or condition. For this, the plugin is called with the option
*check*. If this fails, the plugin will exit with the return value `1`, else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check option is already mentioned some lines above, please merge this.

`0`. Have a look at the default plugins shipped in
`/usr/libexec/health-checker` for examples.

Its behavior depends if the system is using systemd-boot/grub2-bls (i.e.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better: "is using systemd-boot, grub2-bls or any other bootloader following the Boot Loader Specification (BLS) or legacy..."
I wouldn't mention bootloader internal things like the /boot/efi/loader/entries, this can change in the future and a change is already under discussion.

Every new snapshot has a separate boot entry with a boot counter (according to
`/etc/kernel/tries`, which health-checker sets to 3 by default); when that
snapshot is booted for the first time, the bootloader (systemd-boot by
default on MicroOS, but grub2-bls is also supported) will decrease the amount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove "(systemd-boot by default on MicroOS, but grub2-bls is also supported)", not relevant internal implementation detail which can change every time. And then we forget to adjust the text here.

number of timed configured in <filename>/etc/kernel/tries</filename>. If the
system still isn't working, then an emergency shell is started. If it is not
the first boot with the selected snapshot, then an emergency shell is
automatically started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we will never do an automated rollback with systemd-boot or grub2-bls?

# Do not reboot by default if the entry has been chosen manually or the reboot has
# been disabled in the kernel cmdline
# selected_entry contains the boot count, remove it before comparing it to the default entry
if ! grep -qw "health-checker-reboot=disabled" /proc/cmdline; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document this option in the manual page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to disable health-checker in Grub
4 participants