scheduler: basic cluster reconciler safety properties for batch jobs #26172

tgross · 2025-06-30T20:40:26Z

Property test assertions for the core safety properties of the cluster reconciler, for batch jobs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: #26167

tgross · 2025-06-30T20:43:29Z

scheduler/reconciler/reconcile_cluster.go

+	// We can never have more placements than the count
+	if len(result.Place) > tg.Count {
+		result.Place = result.Place[:tg.Count]
+		result.DesiredTGUpdates[tg.Name].Place = uint64(tg.Count)
+	}


Test case that discovered this:

group count 1

alloc with desired=evict and client=lost (alloc was lost and missing heartbeat)

alloc with desired=run and client=failed

we get stop=2, place=2 with both placements being for the failed alloc

This patch is a bit of a sledgehammer on the test but seems like an obvious thing to assert as well. TBD

The following test shows that this is a problem only in the reconciler and not the scheduler overall:

func TestServiceSched_PlacementOvercount(t *testing.T) { ci.Parallel(t) h := tests.NewHarness(t) lostNode := mock.Node() lostNode.Status = structs.NodeStatusDown must.NoError(t, h.State.UpsertNode(structs.MsgTypeTestSetup, h.NextIndex(), lostNode)) node := mock.Node() must.NoError(t, h.State.UpsertNode(structs.MsgTypeTestSetup, h.NextIndex(), node)) job := mock.Job() job.TaskGroups[0].Count = 1 must.NoError(t, h.State.UpsertJob(structs.MsgTypeTestSetup, h.NextIndex(), nil, job)) lostAlloc := mock.AllocForNode(lostNode) lostAlloc.Job = job lostAlloc.JobID = job.ID lostAlloc.Name = "my-job.web[0]" lostAlloc.DesiredStatus = structs.AllocDesiredStatusEvict lostAlloc.ClientStatus = structs.AllocClientStatusLost failedAlloc := mock.AllocForNode(node) failedAlloc.Job = job failedAlloc.JobID = job.ID failedAlloc.Name = "my-job.web[0]" failedAlloc.DesiredStatus = structs.AllocDesiredStatusRun failedAlloc.ClientStatus = structs.AllocClientStatusFailed allocs := []*structs.Allocation{lostAlloc, failedAlloc} must.NoError(t, h.State.UpsertAllocs(structs.MsgTypeTestSetup, h.NextIndex(), allocs)) eval := &structs.Evaluation{ Namespace: structs.DefaultNamespace, ID: uuid.Generate(), Priority: job.Priority, TriggeredBy: structs.EvalTriggerAllocStop, JobID: job.ID, Status: structs.EvalStatusPending, AnnotatePlan: true, } must.NoError(t, h.State.UpsertEvals( structs.MsgTypeTestSetup, h.NextIndex(), []*structs.Evaluation{eval})) err := h.Process(NewServiceScheduler, eval) must.NoError(t, err) must.Len(t, 1, h.Plans) must.Eq(t, 1, h.Plans[0].Annotations.DesiredTGUpdates["web"].Place) must.Eq(t, 1, h.Plans[0].Annotations.DesiredTGUpdates["web"].Stop) }

However, the check we're doing here isn't correct either, because result.Place is the results from all task groups. We need to check for the specific task group and only remove result.Place for other task groups. I'll pull this out to its own PR.

However, the check we're doing here isn't correct either, because result.Place is the results from all task groups.

Whoops, this isn't true either! This is computeGroup. It'll get merged later with the other group.

pkazmierczak · 2025-07-01T15:26:11Z

scheduler/reconciler/reconcile_cluster_prop_test.go

+			if int(tgUpdates.Place) > tgCount {
+				t.Fatal("group placements should never exceed group count")
+			}
+			if int(tgUpdates.DestructiveUpdate) > tgCount {
+				t.Fatal("destructive updates should never exceed group count")
+			}


makes me wonder: shouldn't these also hold for service jobs?

Yeah I assume there's going to be a whole lot of overlap between the properties we want to test in our two PRs. Once we've got a set we're mostly-happy with we can refactor the test to pull out the common set.

To help break down the larger property tests we're doing in #26167 and #26172 into more manageable chunks, pull out a property test for just the `reconcileReconnecting` method. This method helpfully already defines its important properties, so we can implement those as test assertions. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: #26167 Ref: #26172

While working on property testing in #26172 we discovered there are scenarios where the reconciler will produce more than the expected number of placements. Testing of those scenarios at the whole-scheduler level shows that this gets handled correctly downstream of the reconciler, but this makes it harder to reason about reconciler behavior. Cap the number of placements in the reconciler. Ref: #26172

Property test assertions for the core safety proprerties of the cluster reconciler, for batch jobs. The changeset includes fixes for any bugs found during work-in-progress, which will get pulled out to their own PRs. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: #26167

pkazmierczak

LGTM!

tgross added this to the 1.11.0 milestone Jun 30, 2025

tgross added theme/scheduling theme/testing Test related issues labels Jun 30, 2025

vercel bot deployed to Preview – nomad-ui June 30, 2025 20:41 View deployment

tgross commented Jun 30, 2025

View reviewed changes

tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 773b2b5 to 60946b9 Compare June 30, 2025 21:01

vercel bot deployed to Preview – nomad-ui June 30, 2025 21:02 View deployment

pkazmierczak reviewed Jul 1, 2025

View reviewed changes

tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 60946b9 to 7f71eb9 Compare July 1, 2025 18:15

vercel bot deployed to Preview – nomad-ui July 1, 2025 18:16 View deployment

tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 7f71eb9 to d51264a Compare July 1, 2025 18:42

vercel bot deployed to Preview – nomad-ui July 1, 2025 18:43 View deployment

tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from d51264a to 4829f2f Compare July 1, 2025 19:12

vercel bot deployed to Preview – nomad-ui July 1, 2025 19:13 View deployment

tgross changed the base branch from main to f-prop-testing-reconciler-safety July 1, 2025 19:31

tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 4829f2f to 874530c Compare July 1, 2025 19:32

tgross changed the base branch from f-prop-testing-reconciler-safety to main July 1, 2025 19:33

vercel bot deployed to Preview – nomad-ui July 1, 2025 19:34 View deployment

tgross force-pushed the f-prop-testing-reconciler-safety-batch branch 2 times, most recently from 3d01dc2 to 31a487a Compare July 1, 2025 19:35

vercel bot deployed to Preview – nomad-ui July 1, 2025 19:36 View deployment

tgross marked this pull request as ready for review July 1, 2025 19:54

tgross requested review from a team as code owners July 1, 2025 19:54

tgross requested a review from pkazmierczak July 1, 2025 19:54

tgross mentioned this pull request Jul 1, 2025

scheduler: property testing of reconcile reconnecting #26180

Merged

tgross mentioned this pull request Jul 9, 2025

scheduler: reconciler should constrain placements to count #26239

Merged

tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 31a487a to cc849de Compare July 9, 2025 17:48

vercel bot deployed to Preview – nomad-ui July 9, 2025 17:50 View deployment

pkazmierczak approved these changes Jul 9, 2025

View reviewed changes

tgross merged commit 94e03f8 into main Jul 9, 2025
39 checks passed

tgross deleted the f-prop-testing-reconciler-safety-batch branch July 9, 2025 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scheduler: basic cluster reconciler safety properties for batch jobs #26172

scheduler: basic cluster reconciler safety properties for batch jobs #26172

Uh oh!

tgross commented Jun 30, 2025 •

edited

Loading

Uh oh!

tgross Jun 30, 2025 •

edited

Loading

Uh oh!

tgross Jul 9, 2025

Uh oh!

tgross Jul 9, 2025

Uh oh!

tgross Jul 9, 2025

Uh oh!

pkazmierczak Jul 1, 2025

Uh oh!

tgross Jul 1, 2025

Uh oh!

pkazmierczak left a comment

Uh oh!

Uh oh!

Uh oh!

scheduler: basic cluster reconciler safety properties for batch jobs #26172

scheduler: basic cluster reconciler safety properties for batch jobs #26172

Uh oh!

Conversation

tgross commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgross Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgross Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

pkazmierczak Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

pkazmierczak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tgross commented Jun 30, 2025 •

edited

Loading

tgross Jun 30, 2025 •

edited

Loading