Skip to content

scheduler: basic cluster reconciler safety properties for batch jobs #26172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 9, 2025

Conversation

tgross
Copy link
Member

@tgross tgross commented Jun 30, 2025

Property test assertions for the core safety properties of the cluster reconciler, for batch jobs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: #26167

Comment on lines 622 to 643
// We can never have more placements than the count
if len(result.Place) > tg.Count {
result.Place = result.Place[:tg.Count]
result.DesiredTGUpdates[tg.Name].Place = uint64(tg.Count)
}
Copy link
Member Author

@tgross tgross Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case that discovered this:

  • group count 1
  • alloc with desired=evict and client=lost (alloc was lost and missing heartbeat)
  • alloc with desired=run and client=failed
  • we get stop=2, place=2 with both placements being for the failed alloc

This patch is a bit of a sledgehammer on the test but seems like an obvious thing to assert as well. TBD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test shows that this is a problem only in the reconciler and not the scheduler overall:

func TestServiceSched_PlacementOvercount(t *testing.T) {
	ci.Parallel(t)

	h := tests.NewHarness(t)

	lostNode := mock.Node()
	lostNode.Status = structs.NodeStatusDown
	must.NoError(t, h.State.UpsertNode(structs.MsgTypeTestSetup, h.NextIndex(), lostNode))

	node := mock.Node()
	must.NoError(t, h.State.UpsertNode(structs.MsgTypeTestSetup, h.NextIndex(), node))

	job := mock.Job()
	job.TaskGroups[0].Count = 1
	must.NoError(t, h.State.UpsertJob(structs.MsgTypeTestSetup, h.NextIndex(), nil, job))

	lostAlloc := mock.AllocForNode(lostNode)
	lostAlloc.Job = job
	lostAlloc.JobID = job.ID
	lostAlloc.Name = "my-job.web[0]"
	lostAlloc.DesiredStatus = structs.AllocDesiredStatusEvict
	lostAlloc.ClientStatus = structs.AllocClientStatusLost

	failedAlloc := mock.AllocForNode(node)
	failedAlloc.Job = job
	failedAlloc.JobID = job.ID
	failedAlloc.Name = "my-job.web[0]"
	failedAlloc.DesiredStatus = structs.AllocDesiredStatusRun
	failedAlloc.ClientStatus = structs.AllocClientStatusFailed

	allocs := []*structs.Allocation{lostAlloc, failedAlloc}
	must.NoError(t, h.State.UpsertAllocs(structs.MsgTypeTestSetup, h.NextIndex(), allocs))

	eval := &structs.Evaluation{
		Namespace:    structs.DefaultNamespace,
		ID:           uuid.Generate(),
		Priority:     job.Priority,
		TriggeredBy:  structs.EvalTriggerAllocStop,
		JobID:        job.ID,
		Status:       structs.EvalStatusPending,
		AnnotatePlan: true,
	}

	must.NoError(t, h.State.UpsertEvals(
		structs.MsgTypeTestSetup, h.NextIndex(), []*structs.Evaluation{eval}))

	err := h.Process(NewServiceScheduler, eval)
	must.NoError(t, err)
	must.Len(t, 1, h.Plans)
	must.Eq(t, 1, h.Plans[0].Annotations.DesiredTGUpdates["web"].Place)
	must.Eq(t, 1, h.Plans[0].Annotations.DesiredTGUpdates["web"].Stop)
}

However, the check we're doing here isn't correct either, because result.Place is the results from all task groups. We need to check for the specific task group and only remove result.Place for other task groups. I'll pull this out to its own PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the check we're doing here isn't correct either, because result.Place is the results from all task groups.

Whoops, this isn't true either! This is computeGroup. It'll get merged later with the other group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if int(tgUpdates.Place) > tgCount {
t.Fatal("group placements should never exceed group count")
}
if int(tgUpdates.DestructiveUpdate) > tgCount {
t.Fatal("destructive updates should never exceed group count")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes me wonder: shouldn't these also hold for service jobs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I assume there's going to be a whole lot of overlap between the properties we want to test in our two PRs. Once we've got a set we're mostly-happy with we can refactor the test to pull out the common set.

@tgross tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 60946b9 to 7f71eb9 Compare July 1, 2025 18:15
@tgross tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 7f71eb9 to d51264a Compare July 1, 2025 18:42
@tgross tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from d51264a to 4829f2f Compare July 1, 2025 19:12
@tgross tgross changed the base branch from main to f-prop-testing-reconciler-safety July 1, 2025 19:31
@tgross tgross force-pushed the f-prop-testing-reconciler-safety-batch branch from 4829f2f to 874530c Compare July 1, 2025 19:32
@tgross tgross changed the base branch from f-prop-testing-reconciler-safety to main July 1, 2025 19:33
@tgross tgross force-pushed the f-prop-testing-reconciler-safety-batch branch 2 times, most recently from 3d01dc2 to 31a487a Compare July 1, 2025 19:35
@tgross tgross marked this pull request as ready for review July 1, 2025 19:54
@tgross tgross requested review from a team as code owners July 1, 2025 19:54
@tgross tgross requested a review from pkazmierczak July 1, 2025 19:54
tgross added a commit that referenced this pull request Jul 1, 2025
To help break down the larger property tests we're doing in #26167 and #26172
into more manageable chunks, pull out a property test for just the
`reconcileReconnecting` method. This method helpfully already defines its
important properties, so we can implement those as test assertions.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: #26167
Ref: #26172
tgross added a commit that referenced this pull request Jul 2, 2025
To help break down the larger property tests we're doing in #26167 and #26172
into more manageable chunks, pull out a property test for just the
`reconcileReconnecting` method. This method helpfully already defines its
important properties, so we can implement those as test assertions.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: #26167
Ref: #26172
tgross added a commit that referenced this pull request Jul 7, 2025
To help break down the larger property tests we're doing in #26167 and #26172
into more manageable chunks, pull out a property test for just the
`reconcileReconnecting` method. This method helpfully already defines its
important properties, so we can implement those as test assertions.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: #26167
Ref: #26172
tgross added a commit that referenced this pull request Jul 9, 2025
While working on property testing in #26172 we discovered there are scenarios
where the reconciler will produce more than the expected number of
placements. Testing of those scenarios at the whole-scheduler level shows that
this gets handled correctly downstream of the reconciler, but this makes it
harder to reason about reconciler behavior. Cap the number of placements in the
reconciler.

Ref: #26172
tgross added a commit that referenced this pull request Jul 9, 2025
While working on property testing in #26172 we discovered there are scenarios
where the reconciler will produce more than the expected number of
placements. Testing of those scenarios at the whole-scheduler level shows that
this gets handled correctly downstream of the reconciler, but this makes it
harder to reason about reconciler behavior. Cap the number of placements in the
reconciler.

Ref: #26172
Property test assertions for the core safety proprerties of the cluster
reconciler, for batch jobs. The changeset includes fixes for any bugs found
during work-in-progress, which will get pulled out to their own PRs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: #26167
Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross tgross merged commit 94e03f8 into main Jul 9, 2025
39 checks passed
@tgross tgross deleted the f-prop-testing-reconciler-safety-batch branch July 9, 2025 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants