Skip to content

[Mullapudi2016 GPU] Replace or emplace outer dimensions for GPU schedules #8685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

antonysigma
Copy link
Contributor

@antonysigma antonysigma commented Jul 10, 2025

In the nested parallelism scheduling algorithm, whenever the dimension is marked for GPU acceleration, e.g. y -> y_i, y_o, replace the corresponding variable y with y_o in outer_dims.

This ensures the internal assertion dims.size() >= outer_dims.size() is always true for GPU schedules.

The immediate effect is that for a downstream stage having GPU schedules: g.gpu_tile(x, xi, xo, ...), the upstream stage correctly specifies the dimension xo by f.compute_at(g, xo).

Previously, the GPU schedules wrongly specified f.compute_at(g, xi), forcing the recompute of all values in stage g, defeating Mullapudi2016's cache optimization algorithm. The CPU schedule is not impacted before or after the changes.

The fix is now in accordance to the original design intent of the Mullapudi2016 paper.

As a result, the GPU IR correctly synthesizes shared GPU memory to cache the intermediate results of stage f, optimizing for caching.


Also, for all stages at are computed_at, mark all vectorizable inner dimensions as gpu_threads.


In the correctness tests at test/autoscheduler/mullapudi/*.cpp and performance regression tests at apps/*, down adjust the estimated GPU shared memory limit by specifying autoscheduler.last_level_cache_size <= 10000. Except for pipline conv_layer, all pipelines should observe an improvement of caching.

In the nested parallelism scheduling algorithm, whenever the dimension
is marked for GPU acceleration, e.g. `y -> y_i, y_o`, replace the
corresponding variable `y` with `y_o` in `outer_dims`.

This ensures the internal assertion `dims.size() >= outer_dims.size()`
is always true for GPU schedules.

The immediate effect is that for a downstream stage having GPU
schedules: `g.gpu_tile(x, xi, xo, ...)`, the upstream stage correctly
specifies the dimension `xo` by `f.compute_at(g, xo)`. This is in
accordance to the original design intent of the Mullapudi2016 paper.

As a result, the GPU IR correctly synthesizes shared GPU memory to cache
the intermediate results of stage `f`, optimizing for caching.

---

Also, for all stages at are `computed_at`, mark all vectorizable inner
dimensions as `gpu_threads`.

---

In the correctness tests at `test/autoscheduler/mullapudi/*.cpp` and
performance regression tests at `apps/*`, down adjust the estimated GPU
shared memory limit by specifying `autoscheduler.last_level_cache_size
<= 10000`. Except for pipline `conv_layer`, all pipelines should observe
an improvement of caching.
@alexreinking
Copy link
Member

Not sure what's wrong with LLVM 20 right now. All these bots are failing. I'm also not sure what's going on with mac-arm-worker-1... when I log in to the machine, I see it has 500GB of storage left.

@antonysigma
Copy link
Contributor Author

Thanks @alexreinking for troubleshooting the Buildbots.

In the meantime, the error messages from other buildbots uncovers a bug in the PR. I just uploaded a new commit.

-Antony

CPU's threading model is distinct from GPU's thread group model: GPU
shared memory is not shared beyond one GPU thread group.

Whenever nested parallelism is enabled in the Mullapudi2016
auto-scheduler, always implement parallelizable loop dimensions as
`gpu_block`. This can be implemented by splitting the dimensions by a
factor 1: `f.split(z, zi, zo, 1)`.

This makes the autoscheduler's `last_level_cache` estimates per GPU warp
more robust against variations of the nested parallelism.

In the folder `*/apps/`, remove all manual override of
`last_level_cache_size`. Use the default estimate: 47kB per thread
group.
@antonysigma antonysigma force-pushed the mullapudi-outer-dims branch from d32e5dd to a1aa557 Compare July 11, 2025 14:44
@antonysigma antonysigma force-pushed the mullapudi-outer-dims branch from ee44c3b to 34696f9 Compare July 11, 2025 19:41
@antonysigma
Copy link
Contributor Author

The buildlog error logs indicates the experimental GPU scheduling is not ready for a few pipelines (e.g. camera_pipe). I will modify the CMakefiles to skip those tests for non-cuda GPU devices.

@antonysigma antonysigma force-pushed the mullapudi-outer-dims branch from b81de9f to 997f082 Compare July 17, 2025 22:49
@antonysigma antonysigma force-pushed the mullapudi-outer-dims branch from 0ec4904 to 04f563b Compare July 18, 2025 17:34
@alexreinking
Copy link
Member

Looks like CI is passing! Is this ready for review?

@alexreinking alexreinking self-requested a review July 22, 2025 03:40
@antonysigma antonysigma marked this pull request as ready for review July 22, 2025 16:44
@antonysigma
Copy link
Contributor Author

Looks like CI is passing! Is this ready for review?

@alexreinking For sure! Honestly, I am not proud of my edits in the ./apps/ folder. There are a few algorithm pipeline apps that generates inefficient implementations for non-CUDA targets. To pass the buildbot apps tests, I brute-forced my way through with hand tuned autoscheduler parameters.

If you can point out a few apps/targets that should be skipped altogether, I am more than happy to oblige, via the *_metadata()->target trick.

@abadams
Copy link
Member

abadams commented Jul 22, 2025

I don't think we can do the changes to local laplacian. It's supposed to be an example and hard-coding the input size and making it bail out for non-cuda targets ruins that. I'd also like to keep any changes to process.cpp as small as possible, because the idea of this file is that someone should be able to read it to see how to call into Halide-generated code, so we want to aggressively minimize any lines of code that are related to our testing infra rather than using Halide.

Maybe the build can define NO_AUTO_SCHEDULE for opencl, metal, and vulkan?

@mcourteaux
Copy link
Contributor

hard-coding the input size and making it bail out for non-cuda targets

@antonysigma Isn't this exactly the use-case for set_estimates? Is the Mullapudi autoscheduler not using the estimates? I think that, for the auto-scheduler branch, setting estimates should be totally fine. If that doesn't fix the launch issues on metal and opencl, then the autoscheduler is simply broken, in which case it should be disabled for the tests where it doesn't produce good/valid results, or fixed such that it does work for those tests.

@antonysigma PS: try to not force-push, like I mentioned months ago: it makes reviewing harder. Github tracks until which commit someone reviewed. Rewriting git history forces everyone to look at the entire PR again and makes incremental reviews impossible.

@alexreinking
Copy link
Member

Isn't this exactly the use-case for set_estimates? Is the Mullapudi autoscheduler not using the estimates? I think that, for the auto-scheduler branch, setting estimates should be totally fine. If that doesn't fix the launch issues on metal and opencl, then the autoscheduler is simply broken, in which case it should be disabled for the tests where it doesn't produce good/valid results, or fixed such that it does work for those tests.

I would also prefer to see the estimates used in tandem with (real) hardware parameters to produce a valid schedule. I haven't reviewed the PR yet, but that's what I'd hope to see.

try to not force-push, like I mentioned months ago

@mcourteaux -- This isn't a policy we have here... when I review, I just look at the final diff, not at the intermediate commits, so force-pushing doesn't bother me. Part of the reason we squash every PR is so we don't have to care about the cleanliness of the git history on branches. If you'd like, you can bring this up at the next dev meeting.

Copy link
Contributor Author

@antonysigma antonysigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexreinking @abadams @mcourteaux I reverted all changes to the folder ./apps/local_laplacian/, and edited the CMakeFiles.txt to disable the experimental GPU schedule feature.

I appreciate if you can also point out other apps to skip GPU autoscheduler tests.

--

I don't think we can do the changes to local laplacian... keep any changes to process.cpp as small as possible.

Done.

Isn't this exactly the use-case for set_estimates? Is the Mullapudi autoscheduler not using the estimates? I think that, for the auto-scheduler branch, setting estimates should be totally fine.

I would also prefer to see the estimates used in tandem with (real) hardware parameters to produce a valid schedule.

I suppose it is a moot point by now, because I just reverted all changes to ./apps/local_laplacian folder.

If that doesn't fix the launch issues on metal and opencl, then the autoscheduler is simply broken, in which case it should be disabled for the tests.

I concur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants