-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Mullapudi2016 GPU] Replace or emplace outer dimensions for GPU schedules #8685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In the nested parallelism scheduling algorithm, whenever the dimension is marked for GPU acceleration, e.g. `y -> y_i, y_o`, replace the corresponding variable `y` with `y_o` in `outer_dims`. This ensures the internal assertion `dims.size() >= outer_dims.size()` is always true for GPU schedules. The immediate effect is that for a downstream stage having GPU schedules: `g.gpu_tile(x, xi, xo, ...)`, the upstream stage correctly specifies the dimension `xo` by `f.compute_at(g, xo)`. This is in accordance to the original design intent of the Mullapudi2016 paper. As a result, the GPU IR correctly synthesizes shared GPU memory to cache the intermediate results of stage `f`, optimizing for caching. --- Also, for all stages at are `computed_at`, mark all vectorizable inner dimensions as `gpu_threads`. --- In the correctness tests at `test/autoscheduler/mullapudi/*.cpp` and performance regression tests at `apps/*`, down adjust the estimated GPU shared memory limit by specifying `autoscheduler.last_level_cache_size <= 10000`. Except for pipline `conv_layer`, all pipelines should observe an improvement of caching.
Not sure what's wrong with LLVM 20 right now. All these bots are failing. I'm also not sure what's going on with mac-arm-worker-1... when I log in to the machine, I see it has 500GB of storage left. |
Thanks @alexreinking for troubleshooting the Buildbots. In the meantime, the error messages from other buildbots uncovers a bug in the PR. I just uploaded a new commit. -Antony |
CPU's threading model is distinct from GPU's thread group model: GPU shared memory is not shared beyond one GPU thread group. Whenever nested parallelism is enabled in the Mullapudi2016 auto-scheduler, always implement parallelizable loop dimensions as `gpu_block`. This can be implemented by splitting the dimensions by a factor 1: `f.split(z, zi, zo, 1)`. This makes the autoscheduler's `last_level_cache` estimates per GPU warp more robust against variations of the nested parallelism. In the folder `*/apps/`, remove all manual override of `last_level_cache_size`. Use the default estimate: 47kB per thread group.
d32e5dd
to
a1aa557
Compare
ee44c3b
to
34696f9
Compare
The buildlog error logs indicates the experimental GPU scheduling is not ready for a few pipelines (e.g. |
b81de9f
to
997f082
Compare
0ec4904
to
04f563b
Compare
Looks like CI is passing! Is this ready for review? |
@alexreinking For sure! Honestly, I am not proud of my edits in the If you can point out a few apps/targets that should be skipped altogether, I am more than happy to oblige, via the |
I don't think we can do the changes to local laplacian. It's supposed to be an example and hard-coding the input size and making it bail out for non-cuda targets ruins that. I'd also like to keep any changes to process.cpp as small as possible, because the idea of this file is that someone should be able to read it to see how to call into Halide-generated code, so we want to aggressively minimize any lines of code that are related to our testing infra rather than using Halide. Maybe the build can define NO_AUTO_SCHEDULE for opencl, metal, and vulkan? |
@antonysigma Isn't this exactly the use-case for @antonysigma PS: try to not force-push, like I mentioned months ago: it makes reviewing harder. Github tracks until which commit someone reviewed. Rewriting git history forces everyone to look at the entire PR again and makes incremental reviews impossible. |
I would also prefer to see the estimates used in tandem with (real) hardware parameters to produce a valid schedule. I haven't reviewed the PR yet, but that's what I'd hope to see.
@mcourteaux -- This isn't a policy we have here... when I review, I just look at the final diff, not at the intermediate commits, so force-pushing doesn't bother me. Part of the reason we squash every PR is so we don't have to care about the cleanliness of the git history on branches. If you'd like, you can bring this up at the next dev meeting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexreinking @abadams @mcourteaux I reverted all changes to the folder ./apps/local_laplacian/
, and edited the CMakeFiles.txt
to disable the experimental GPU schedule feature.
I appreciate if you can also point out other apps
to skip GPU autoscheduler tests.
--
I don't think we can do the changes to local laplacian... keep any changes to process.cpp as small as possible.
Done.
Isn't this exactly the use-case for set_estimates? Is the Mullapudi autoscheduler not using the estimates? I think that, for the auto-scheduler branch, setting estimates should be totally fine.
I would also prefer to see the estimates used in tandem with (real) hardware parameters to produce a valid schedule.
I suppose it is a moot point by now, because I just reverted all changes to ./apps/local_laplacian
folder.
If that doesn't fix the launch issues on metal and opencl, then the autoscheduler is simply broken, in which case it should be disabled for the tests.
I concur.
In the nested parallelism scheduling algorithm, whenever the dimension is marked for GPU acceleration, e.g.
y -> y_i, y_o
, replace the corresponding variabley
withy_o
inouter_dims
.This ensures the internal assertion
dims.size() >= outer_dims.size()
is always true for GPU schedules.The immediate effect is that for a downstream stage having GPU schedules:
g.gpu_tile(x, xi, xo, ...)
, the upstream stage correctly specifies the dimensionxo
byf.compute_at(g, xo)
.Previously, the GPU schedules wrongly specified
f.compute_at(g, xi)
, forcing the recompute of all values in stageg
, defeating Mullapudi2016's cache optimization algorithm. The CPU schedule is not impacted before or after the changes.The fix is now in accordance to the original design intent of the Mullapudi2016 paper.
As a result, the GPU IR correctly synthesizes shared GPU memory to cache the intermediate results of stage
f
, optimizing for caching.Also, for all stages at are
computed_at
, mark all vectorizable inner dimensions asgpu_threads
.In the correctness tests at
test/autoscheduler/mullapudi/*.cpp
and performance regression tests atapps/*
, down adjust the estimated GPU shared memory limit by specifyingautoscheduler.last_level_cache_size <= 10000
. Except for piplineconv_layer
, all pipelines should observe an improvement of caching.