[Mullapudi2016 GPU] Replace or emplace outer dimensions for GPU schedules #8685

antonysigma · 2025-07-10T22:15:07Z

In the nested parallelism scheduling algorithm, whenever the dimension is marked for GPU acceleration, e.g. y -> y_i, y_o, replace the corresponding variable y with y_o in outer_dims.

This ensures the internal assertion dims.size() >= outer_dims.size() is always true for GPU schedules.

The immediate effect is that for a downstream stage having GPU schedules: g.gpu_tile(x, xi, xo, ...), the upstream stage correctly specifies the dimension xo by f.compute_at(g, xo).

Previously, the GPU schedules wrongly specified f.compute_at(g, xi), forcing the recompute of all values in stage g, defeating Mullapudi2016's cache optimization algorithm. The CPU schedule is not impacted before or after the changes.

The fix is now in accordance to the original design intent of the Mullapudi2016 paper.

As a result, the GPU IR correctly synthesizes shared GPU memory to cache the intermediate results of stage f, optimizing for caching.

Also, for all stages at are computed_at, mark all vectorizable inner dimensions as gpu_threads.

In the correctness tests at test/autoscheduler/mullapudi/*.cpp and performance regression tests at apps/*, down adjust the estimated GPU shared memory limit by specifying autoscheduler.last_level_cache_size <= 10000. Except for pipline conv_layer, all pipelines should observe an improvement of caching.

In the nested parallelism scheduling algorithm, whenever the dimension is marked for GPU acceleration, e.g. `y -> y_i, y_o`, replace the corresponding variable `y` with `y_o` in `outer_dims`. This ensures the internal assertion `dims.size() >= outer_dims.size()` is always true for GPU schedules. The immediate effect is that for a downstream stage having GPU schedules: `g.gpu_tile(x, xi, xo, ...)`, the upstream stage correctly specifies the dimension `xo` by `f.compute_at(g, xo)`. This is in accordance to the original design intent of the Mullapudi2016 paper. As a result, the GPU IR correctly synthesizes shared GPU memory to cache the intermediate results of stage `f`, optimizing for caching. --- Also, for all stages at are `computed_at`, mark all vectorizable inner dimensions as `gpu_threads`. --- In the correctness tests at `test/autoscheduler/mullapudi/*.cpp` and performance regression tests at `apps/*`, down adjust the estimated GPU shared memory limit by specifying `autoscheduler.last_level_cache_size <= 10000`. Except for pipline `conv_layer`, all pipelines should observe an improvement of caching.

alexreinking · 2025-07-11T12:48:52Z

Not sure what's wrong with LLVM 20 right now. All these bots are failing. I'm also not sure what's going on with mac-arm-worker-1... when I log in to the machine, I see it has 500GB of storage left.

antonysigma · 2025-07-11T14:43:13Z

Thanks @alexreinking for troubleshooting the Buildbots.

In the meantime, the error messages from other buildbots uncovers a bug in the PR. I just uploaded a new commit.

-Antony

CPU's threading model is distinct from GPU's thread group model: GPU shared memory is not shared beyond one GPU thread group. Whenever nested parallelism is enabled in the Mullapudi2016 auto-scheduler, always implement parallelizable loop dimensions as `gpu_block`. This can be implemented by splitting the dimensions by a factor 1: `f.split(z, zi, zo, 1)`. This makes the autoscheduler's `last_level_cache` estimates per GPU warp more robust against variations of the nested parallelism. In the folder `*/apps/`, remove all manual override of `last_level_cache_size`. Use the default estimate: 47kB per thread group.

antonysigma · 2025-07-14T19:00:36Z

The buildlog error logs indicates the experimental GPU scheduling is not ready for a few pipelines (e.g. camera_pipe). I will modify the CMakefiles to skip those tests for non-cuda GPU devices.

…pipelines

apps/iir_blur/CMakeLists.txt

apps/iir_blur/filter.cpp

alexreinking · 2025-07-22T03:40:48Z

Looks like CI is passing! Is this ready for review?

antonysigma · 2025-07-22T16:51:12Z

Looks like CI is passing! Is this ready for review?

@alexreinking For sure! Honestly, I am not proud of my edits in the ./apps/ folder. There are a few algorithm pipeline apps that generates inefficient implementations for non-CUDA targets. To pass the buildbot apps tests, I brute-forced my way through with hand tuned autoscheduler parameters.

If you can point out a few apps/targets that should be skipped altogether, I am more than happy to oblige, via the *_metadata()->target trick.

abadams · 2025-07-22T18:05:41Z

I don't think we can do the changes to local laplacian. It's supposed to be an example and hard-coding the input size and making it bail out for non-cuda targets ruins that. I'd also like to keep any changes to process.cpp as small as possible, because the idea of this file is that someone should be able to read it to see how to call into Halide-generated code, so we want to aggressively minimize any lines of code that are related to our testing infra rather than using Halide.

Maybe the build can define NO_AUTO_SCHEDULE for opencl, metal, and vulkan?

mcourteaux · 2025-07-22T18:12:54Z

hard-coding the input size and making it bail out for non-cuda targets

@antonysigma Isn't this exactly the use-case for set_estimates? Is the Mullapudi autoscheduler not using the estimates? I think that, for the auto-scheduler branch, setting estimates should be totally fine. If that doesn't fix the launch issues on metal and opencl, then the autoscheduler is simply broken, in which case it should be disabled for the tests where it doesn't produce good/valid results, or fixed such that it does work for those tests.

@antonysigma PS: try to not force-push, like I mentioned months ago: it makes reviewing harder. Github tracks until which commit someone reviewed. Rewriting git history forces everyone to look at the entire PR again and makes incremental reviews impossible.

alexreinking · 2025-07-22T20:08:46Z

Isn't this exactly the use-case for set_estimates? Is the Mullapudi autoscheduler not using the estimates? I think that, for the auto-scheduler branch, setting estimates should be totally fine. If that doesn't fix the launch issues on metal and opencl, then the autoscheduler is simply broken, in which case it should be disabled for the tests where it doesn't produce good/valid results, or fixed such that it does work for those tests.

I would also prefer to see the estimates used in tandem with (real) hardware parameters to produce a valid schedule. I haven't reviewed the PR yet, but that's what I'd hope to see.

try to not force-push, like I mentioned months ago

@mcourteaux -- This isn't a policy we have here... when I review, I just look at the final diff, not at the intermediate commits, so force-pushing doesn't bother me. Part of the reason we squash every PR is so we don't have to care about the cleanliness of the git history on branches. If you'd like, you can bring this up at the next dev meeting.

antonysigma

@alexreinking @abadams @mcourteaux I reverted all changes to the folder ./apps/local_laplacian/, and edited the CMakeFiles.txt to disable the experimental GPU schedule feature.

I appreciate if you can also point out other apps to skip GPU autoscheduler tests.

--

I don't think we can do the changes to local laplacian... keep any changes to process.cpp as small as possible.

Done.

Isn't this exactly the use-case for set_estimates? Is the Mullapudi autoscheduler not using the estimates? I think that, for the auto-scheduler branch, setting estimates should be totally fine.

I would also prefer to see the estimates used in tandem with (real) hardware parameters to produce a valid schedule.

I suppose it is a moot point by now, because I just reverted all changes to ./apps/local_laplacian folder.

If that doesn't fix the launch issues on metal and opencl, then the autoscheduler is simply broken, in which case it should be disabled for the tests.

I concur.

alexreinking requested review from alexreinking and halidebuildbots July 11, 2025 00:17

antonysigma force-pushed the mullapudi-outer-dims branch from d32e5dd to a1aa557 Compare July 11, 2025 14:44

Limit the threads in max_parallelism

34696f9

antonysigma force-pushed the mullapudi-outer-dims branch from ee44c3b to 34696f9 Compare July 11, 2025 19:41

antonysigma added 4 commits July 15, 2025 19:54

Suppress GPU shared memory size estimates for challengling algorithm …

30b2944

…pipelines

Eliminate double negation of the conditionals

b3d2f7f

Also estimate the last_level_cache size for Vulkan devices

56bff19

Hardcode input/output dimensions of local laplacian

997f082

antonysigma force-pushed the mullapudi-outer-dims branch from b81de9f to 997f082 Compare July 17, 2025 22:49

Explicitly skipping local_laplacian tests for GPU targets

04f563b

antonysigma force-pushed the mullapudi-outer-dims branch from 0ec4904 to 04f563b Compare July 18, 2025 17:34

Skip iir_blur tests on Metal and OpenCL devices

1468c78

alexreinking reviewed Jul 20, 2025

View reviewed changes

apps/iir_blur/CMakeLists.txt Outdated Show resolved Hide resolved

alexreinking reviewed Jul 20, 2025

View reviewed changes

apps/iir_blur/filter.cpp Outdated Show resolved Hide resolved

antonysigma added 2 commits July 21, 2025 10:34

Retrieve target string from metadata

d6d1822

Eliminate compute_at in unsharp_filter for metal device

dc105a0

alexreinking self-requested a review July 22, 2025 03:40

antonysigma marked this pull request as ready for review July 22, 2025 16:44

Always disable experimental GPU autoscheduler for Local Laplacian

b46e0ff

antonysigma commented Jul 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Mullapudi2016 GPU] Replace or emplace outer dimensions for GPU schedules #8685

[Mullapudi2016 GPU] Replace or emplace outer dimensions for GPU schedules #8685

Uh oh!

antonysigma commented Jul 10, 2025 •

edited

Loading

Uh oh!

alexreinking commented Jul 11, 2025

Uh oh!

antonysigma commented Jul 11, 2025

Uh oh!

antonysigma commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

alexreinking commented Jul 22, 2025

Uh oh!

antonysigma commented Jul 22, 2025

Uh oh!

abadams commented Jul 22, 2025

Uh oh!

mcourteaux commented Jul 22, 2025

Uh oh!

alexreinking commented Jul 22, 2025

Uh oh!

antonysigma left a comment

Uh oh!

Uh oh!

[Mullapudi2016 GPU] Replace or emplace outer dimensions for GPU schedules #8685

Are you sure you want to change the base?

[Mullapudi2016 GPU] Replace or emplace outer dimensions for GPU schedules #8685

Uh oh!

Conversation

antonysigma commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexreinking commented Jul 11, 2025

Uh oh!

antonysigma commented Jul 11, 2025

Uh oh!

antonysigma commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

alexreinking commented Jul 22, 2025

Uh oh!

antonysigma commented Jul 22, 2025

Uh oh!

abadams commented Jul 22, 2025

Uh oh!

mcourteaux commented Jul 22, 2025

Uh oh!

alexreinking commented Jul 22, 2025

Uh oh!

antonysigma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antonysigma commented Jul 10, 2025 •

edited

Loading