Skip to content

Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass #8629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mcourteaux
Copy link
Contributor

@mcourteaux mcourteaux commented May 20, 2025

🐛 Bugfixes:

  • Shuffle bug in SPIRV codegen, which was introduced 2 years ago in 4d86539 is here fixed.
  • Deinterleave recursed incorrectly into Shuffle arguments.
  • CodeGen_Hexagon::shuffle_vectors() had a bug rewriting shuffles of shuffles incorrectly.

⭐ New feature: Wrote a vector legalization lowering pass that comes near the end of the lowering. For loop Device APIs determine the maximal lane count for the expressions inside that for. Shuffles get lifted into their own variable, such that splitting into groups of lanes is done without recalculations.

  • There are a few unsupported scenarios, which are reported as internal_errors:
    1. VectorReduce with output lanes > 1.
    2. Reinterpret with input/output having different number of bits per element.
  • This lowering pass was stress tested by running all tests in the test suite with artificial vector lane limit of 4 (this artificial limit is disabled again in this PR).
  • The test error/metal_vector_too_large is converted into correctness/metal_long_vectors as this is now supported.

🧹 Cleanup errors: no newline needed for HeapPrinter, and helper macro vk_report_error to print error codes. This trailing newline is pretty much everywhere in the codebase with a 50% probability for internal_assert and internal_error. This could use a more broad cleanup.


Fixes #8628 (see for details): use OpVectorShuffle instead of OpCompositeInsert.
Fixes #8569 (drive-by)

@mcourteaux mcourteaux added code_cleanup No functional changes. Reformatting, reorganizing, or refactoring existing code. gpu labels May 20, 2025
@@ -37,7 +37,7 @@ extern "C" WEAK void *halide_opencl_get_symbol(void *user_context, const char *n
#ifdef WINDOWS
"opencl.dll",
#else
"libOpenCL.so",
"libOpenCL.so.1",
Copy link
Contributor Author

@mcourteaux mcourteaux May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Driveby change. Still not merged in any other PR. Fixes #8569.

@@ -279,6 +279,8 @@ const char *vk_get_error_name(VkResult error) {
return "VK_ERROR_FORMAT_NOT_SUPPORTED";
case VK_ERROR_FRAGMENTED_POOL:
return "VK_ERROR_FRAGMENTED_POOL";
case VK_ERROR_UNKNOWN:
return "VK_ERROR_UNKNOWN";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More indicative than <Unknown Vulkan Error> in the default branch below.

@mcourteaux
Copy link
Contributor Author

mcourteaux commented May 20, 2025

Regarding the build failure: some simplification rules in Simplify_Shuffle.cpp rewrite shuffles and thereby exceed the maximal vector length on the GPU backend (triggered by my randomized shuffle indices test). I hate this, as my PR was supposed to be simple, and now I'm uncovering more bugs than I signed up for 😝

I see two possible solutions:

  1. Limit the rules to never exceed the maximal vector length per backend. I don't like this for several reasons:
    • These simplification rules are embedded in the general simplify() call, which can happen at any point, and is not yet aware of selected GPU APIs.
    • This makes the simplification rules restricted, and might limit finding better simplifications if the internal rewriting logic cannot exceed some arbitrary vector size.
  2. The GPU backends need to handle arbitrary vector size expressions, just like the LLVM backend does. You can .vectorize(x, 17) and it just works. This might not be easy to achieve, but I feel like this is the better solution, as it keeps the simplifier as is, and brings the scheduling possibilities in line with the LLVM-based codegen backends. This might be one shared lowering pass for GPU backends with a max-vector-size parameter that rewrites these expressions again. Not sure if that's easy to do...

Opinions @abadams @zvookin ?

@abadams
Copy link
Member

abadams commented May 20, 2025

I think the gpu backends need to handle arbitrary vector sizes. The rest of the compiler is free to make vectors of any size. A shared legalization pass for non-llvm backends that maps from Halide IR to Halide IR with narrower vectors might work, but seems a little tricky to get right for things like vectorreduce nodes. The other approach would be just changing how all ops are printed to handle small bundles of values, but this seems even nastier.

Basically I agree with your option 2.

@mcourteaux
Copy link
Contributor Author

mcourteaux commented May 22, 2025

I'd consider this to be greatly out of scope of this bugfix PR. I guess I'll skip that test for now and make an issue?

Looking at the IR that triggers the error:

let t397 = ramp(.thread_id_x + g$8.s0.x.v17.base.s, g$8.extent.0, 4)
let t398 = concat_vectors(f0$8[t397], f1$8[t397])
let t404 = extract_element(t398, 1)
let t405 = extract_element(t398, 0)

which came from this:

g$8(g$8.s0.x) = let t374 = slice_vectors(
    concat_vectors(f0$8(g$8.s0.x, 0), f0$8(g$8.s0.x, 1), f0$8(g$8.s0.x, 2), f0$8(g$8.s0.x, 3)), 
    concat_vectors(f1$8(g$8.s0.x, 4), f1$8(g$8.s0.x, 5), f1$8(g$8.s0.x, 6), f1$8(g$8.s0.x, 7)),
    1, -1, 2)
  in (let t375 = (t374*t374)
     in (extract_element(t375, 0) + extract_element(t375, 1))
  )

It seems that the simplifier rules have done a good job simplifying it, but there is some simplifications missing, or the simplifier rules have gotten stuck in a local minimum. The two ramped loads of size 4 get concatenated into a vector of size 8, to then just take elements 0 and 1 out of it. Very inefficient, compared to just doing two loads (or one ramped load). Of course, this is due to the unnatural way of constructing it with all these explicit shuffles in the test, but perhaps, having Halide simplify this further might be achievable for this PR? @abadams Any ideas on improving the codegen for this?

@mcourteaux mcourteaux added skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. and removed skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. labels May 22, 2025
@mcourteaux
Copy link
Contributor Author

Aaarrrghghhhh llvm/llvm-project@735209c

The following Auto-Upgrade rules are used to maintain compatibility with
IR using the legacy intrinsics:

  • llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)

Clearly doesn't work for us here... 😢

@alexreinking alexreinking changed the title Fix Vulkan interleave two vectors bug. Fixes #8628. Fix Vulkan interleave two vectors bug May 22, 2025
@alexreinking
Copy link
Member

n.b. - the "fixes #NN" magic belongs on a single line (one per line if multiple) in either the PR description or the final commit description (in the GitHub interface, which by default concatenates all the intermediate commit messages). It only clutters the PR title.

@mcourteaux
Copy link
Contributor Author

Specifically asking for review from @abadams as I uncovered another bug, but this time in the Deinterleaver transformation.

This transformation is implemented as a GraphIRMutator, but is not supposed to recurse fully, because it's goal is to produce lane-extracted Exprs from other Exprs, such as t340.odd_lanes. The bug in this instance was due to recursing too deeply:

let t99 = f0[ramp(0, 1, 8)]
let t100 = shuffle(t99, 0, 1, 2, 3)
let t101 = shuffle(t100, 0, 1)

When the deinterleaver extracts t101.even_lanes, it realizes that the resulting type has 1 lane (i.e., scalar). But then it recurses into t100, and incorrectly assumed t100 was of the same type as t101.

So, in my opinion, I don't think this Deinterleaver should ever recurse into shuffle arguments. But as I didn't write this transformation, nor do I know fully what it's purpose is, I am hesitant just deleting the recursion there:

Halide/src/Deinterleave.cpp

Lines 389 to 403 in 85a3b07

// If this is extracting a single lane, try to recursively deinterleave rather
// than leaving behind a shuffle.
if (indices.size() == 1) {
int index = indices.front();
for (const auto &i : op->vectors) {
if (index < i.type().lanes()) {
ScopedValue<int> lane(starting_lane, index);
return mutate(i);
}
index -= i.type().lanes();
}
internal_error << "extract_lane index out of bounds: " << Expr(op) << " " << index << "\n";
}
return Shuffle::make(op->vectors, indices);

I don't understand what the purpose of is of trying to extract odd and even lanes from a shuffle argument if the shuffle is actually just an element extraction.

@mcourteaux mcourteaux force-pushed the fix-vulkan-interleave branch from 3b6f14d to cf6312e Compare May 24, 2025 13:06
@mcourteaux mcourteaux changed the title Fix Vulkan interleave two vectors bug Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass May 27, 2025
@mcourteaux mcourteaux added enhancement New user-visible features or improvements to existing features. release_notes For changes that may warrant a note in README for official releases. labels May 27, 2025
@mcourteaux
Copy link
Contributor Author

@derek-gerstmann already reviewed the Vulkan interleave codegen. That part didn't change. This involved the changes in:

  • CodeGen_Vulkan_Dev.cpp
  • vulkan_internal.cpp
  • vulkan_resources.cpp

I worked together with @abadams on a Shuffle simplification bug in Simplify_Shuffle.cpp, so that part is already "reviewed" (pair programmed, you could say).

It'd be great if @abadams could look at the following for review:

  • LegalizeVectors.cpp
  • The small change in CSE.cpp to not lift Calls if they are not pure (with the exception for CallType::Halide, which can be lifted, according to a test case).
  • My bugfix in Deinterleave.cpp

return name + ".lanes_" + std::to_string(lane_start) + "_" + std::to_string(lane_start + lane_count - 1);
}

Expr simplify_shuffle(const Shuffle *op) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here rather than in the simplifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I can't call the simplifier on the shuffle, and expect it to only touch the shuffle. I can only do simplify(...) which runs ALL of the simplifier logic. It's a bit pitty/unintuitive that Simplify_Shuffle.cpp is not accessible as is.

Also, I wasn't too sure I could add that to the general simplifier code either. I could try to merge the two procedures. Perhaps other places benefit from these simplifier rules too then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try this tomorrow. It indeed is late.

// user_error << "Cannot legalize vectors when tracing is enabled.";
auto event = as_const_int(op->args[6]);
internal_assert(event);
if (*event == halide_trace_load || *event == halide_trace_store) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's a good idea to preserve only trace loads and trace stores, because those are supposed to be nested in other tracing events. Or is the idea that those other events won't see this mutator, because they're scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite didn't show these to be nested anywhere. They are surrounded by other trace events, such as begin and end of a Func. AFAIK, they weren't nested. To be transparent: I have never ever used the tracing features. I was just looking at IR before and after legalization, to make sure it all seemed reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, by nested I meant they should execute after a begin_realization event (or whatever it's called), and before an end_realization, so it would be bad to drop those outer events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I think they never are processed here. The begin_realization and end_realization trace calls are never involved in vectorized expressions. So this ExtractLanes mutator will never be ran on those IR nodes. Perhaps I should turn this into an internal_assert() to validate my idea.

}
};

class ExtractLanes : public IRMutator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different to the deinterleave function in Deinterleave.cpp? Should they be unified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps. I think I didn't understand what Deinterleave was doing. And I'm not too sure I do now. Deinterleaver and Interleaver are doing a weird dance together which I didn't understand either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm comparing Deinterleaver and ExtractLanes. Can you have a look at their Load visitors? The alignment gets dropped if the starting lane is not 0. I don't understand why we wouldn't simply update the alignment, like I did in the ExtractLanes version. Is this an oversight in the Deinterleaver, or am I not understanding the rationale behind dropping it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment is the alignment of the first lane. I think your logic is only correct if the load is of a ramp with stride 1. If it's some gather of some complex expression, that's not the right way to update it. In the cases we can safely update it, the simplifier can reinfer it very easily, so I thought it best to just leave it to the next simplifier pass.

// This should be possible too. TODO
}

internal_assert(op->type.lanes() == 1) << "Vector legalization currently does not support VectorReduce with lanes != 1: " << Expr(op);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an internal assert or a user assert? If it's something a user might see it should give instructions on how to avoid it (e.g. try vectorizing more narrowly to avoid exceeding the maximum vector width for this device api)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with internal_assert because this legalization pass should make all vectorization sizes legal, conceptually. If this particular case is not handled, that's not the users fault really. So I thought that internal_error is used when Halide is the blame, and user_error when the user is the blame. Either way both are reported to the user. I can definitely add that this can be avoided by doing more narrow vectorization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the legalization_error_guide to the relevant internal_errors.

debug(3) << "After CSE:\n"
<< m1 << "\n";
}
Stmt m2 = LiftLetToLetStmt().mutate(m1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditionally turning lets into letstmts for all backends scares me a little. I'm worried about cases where, e.g. codegen needs to do analysis on the index of a store, and it would have succeeded if the index had the let statements it depends on included in the Expr, but because we're already inside the letstmts, we don't have access to them. In theory analysis like this should all take a scope of enclosing stuff, but I don't know if that actually happens in practice. So in principle this change should be fine, but it substantially raises the testing bar we need. It might be safer to have an outer mutator that only runs this logic on the bodies of for loops with a device api that has a vector lane limit. It would make compilation faster for code not using such device apis anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this concern apply, given that this legalization pass is pretty much in the very back of the lowering process? I'd assume that all these kinds of analysis are already done by that point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some things that happen in codegen. E.g. when lowering integer division to LLVM IR, Halide attempts to prove the denominator is non-zero, and if so emits simpler LLVM IR.

Copy link
Contributor Author

@mcourteaux mcourteaux Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen this type of warning that some expression could not be proved before (but not while working on the legalization, iirc. I saw that on the CSE repro I posted yesterday).

The lifting of let to letstmt was mostly needed to handle shuffles of vectors with funny indices. The constituent vectors would become defined in LetStmts. As such, the original shuffle expression, which is legalized by breaking it up into multiple shuffles, can then reuse those constituent vectors for every generated shuffle, after which they get simplified where possible.

I'm wondering how this is different from CSE? Wouldn't generic CSE also make it in the same way impossible to prove certain things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSE does sometimes cause problems of this nature, yeah. In principle this transformation should be fine. But my high-level point here is that if we were sure we're not changing the IR at all for LLVM targets like say ARM or hexagon DSP, the PR can be merged without risk of e.g. silently regressing some obscure corner case inside Google due to IR being shuffled around in a way that makes some optimization no longer trigger. It would also save compile time in the common case to only apply this pass to device API loops that need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's better then to run the CSE pass also only inside the affected loops, instead of starting with that as second step in Stmt legalize_vectors(const Stmt &s)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would run the entire current body of legalize_vectors only on the affected loops, by adding a new mutator outside of this. I.e. instead of your current:

Stmt legalize_vectors(const Stmt &s) { ... logic ...}

I'd do:

Stmt legalize_vectors(const Stmt &s) {
  class LegalizeAllDeviceLoops : public IRMutator {
    using IRMutator::visit;
    Stmt visit(const For *op) override {
      if (op->device_api has a lane limit) {
        ... logic ...
      } else {
        return IRMutator::visit(op);
      }
    }
  } mutator;
  return mutator.mutate(s);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that seems like it's going to reduce some repetitive code of checking if we are in such a loop. Will do that tomorrow.

@abadams
Copy link
Member

abadams commented Jun 4, 2025

Ok, I have done my pass over the files requested. My only major concern is that I'd prefer it if the pass did not make any changes to IR not inside a loop over a device api with a max lanes restriction.

@mcourteaux mcourteaux force-pushed the fix-vulkan-interleave branch from 059c815 to 0a10052 Compare June 5, 2025 00:05
@abadams abadams added the dev_meeting Topic to be discussed at the next dev meeting label Jun 5, 2025
@abadams
Copy link
Member

abadams commented Jun 5, 2025

Summary from out of band meeting: We need to discuss at the dev meeting about what to do about the simplifier now wanting to undo the hexagon backend's desire to turn an unaligned dense load into a slice of two aligned dense loads. It calls the simplifier after doing this. Maybe hexagon only does this to increase opportunities for, e.g. CSE and loop-carry stuff, which are passes that happen before it calls the simplifier. If that's the case, we can just change the test to actually present an such an opportunity, by making it a sum of two overlapping unaligned dense loads, instead of just one unaligned dense load, so that CSE has something to do. Alternatively, it could be that you just always want to do this on Hexagon, in which case it should probably (also?) happen in the codegen hexagon load visitor. But we really need someone more familiar with Hexagon to chime in.

@abadams
Copy link
Member

abadams commented Jun 6, 2025

Summary from dev meeting. We need to find out if it's actually a good idea to turn an unaligned vector load into a valign of two aligned vector loads, in the absence of any other optimizations kicking in.

A) If it's not profitable, or if it doesn't matter one way or the other, we just need to change simd_op_check to present CSE opportunities so that valign gets generated and the tests stop failing.

B) If it does matter, we should just do the align_loads mutator again inside the hexagon codegen load visitor when it encounters a misaligned dense load.

I am asking a few people.

@abadams
Copy link
Member

abadams commented Jun 9, 2025

Based on my queries, it seems very unlikely that it's profitable if no other optimizations kick in, so I think we should just change the tests to present opportunities for CSE.

@mcourteaux
Copy link
Contributor Author

How do we proceed? Can somebody add a commit to this PR that fixes the Hexagon tests to accept the new generated code?

@abadams
Copy link
Member

abadams commented Jun 9, 2025

Looks like the branch is in your fork, so I can't commit to it directly, but try this:

$ git diff
diff --git a/test/correctness/simd_op_check_hvx.cpp b/test/correctness/simd_op_check_hvx.cpp
index d07a74989..d7d779f47 100644
--- a/test/correctness/simd_op_check_hvx.cpp
+++ b/test/correctness/simd_op_check_hvx.cpp
@@ -54,16 +54,24 @@ public:
             isa_version = 62;
         }
 
+        auto valign_test_u8 = [&](int off) {
+            return in_u8(x + off) + in_u8(x + off + 1);
+        };
+
+        auto valign_test_u16 = [&](int off) {
+            return in_u16(x + off) + in_u16(x + off + 1);
+        };
+
         // Verify that unaligned loads use the right instructions, and don't try to use
         // immediates of more than 3 bits.
-        check("valign(v*,v*,#7)", hvx_width / 1, in_u8(x + 7));
-        check("vlalign(v*,v*,#7)", hvx_width / 1, in_u8(x + hvx_width - 7));
-        check("valign(v*,v*,r*)", hvx_width / 1, in_u8(x + 8));
-        check("valign(v*,v*,r*)", hvx_width / 1, in_u8(x + hvx_width - 8));
-        check("valign(v*,v*,#6)", hvx_width / 1, in_u16(x + 3));
-        check("vlalign(v*,v*,#6)", hvx_width / 1, in_u16(x + hvx_width - 3));
-        check("valign(v*,v*,r*)", hvx_width / 1, in_u16(x + 4));
-        check("valign(v*,v*,r*)", hvx_width / 1, in_u16(x + hvx_width - 4));
+        check("valign(v*,v*,#7)", hvx_width / 1, valign_test_u8(6));
+        check("vlalign(v*,v*,#7)", hvx_width / 1, valign_test_u8(hvx_width - 7));
+        check("valign(v*,v*,r*)", hvx_width / 1, valign_test_u8(8));
+        check("valign(v*,v*,r*)", hvx_width / 1, valign_test_u8(hvx_width - 8));
+        check("valign(v*,v*,#6)", hvx_width / 1, valign_test_u16(3));
+        check("vlalign(v*,v*,#6)", hvx_width / 1, valign_test_u16(hvx_width - 3));
+        check("valign(v*,v*,r*)", hvx_width / 1, valign_test_u16(4));
+        check("valign(v*,v*,r*)", hvx_width / 1, valign_test_u16(hvx_width - 4));
 
         check("vunpack(v*.ub)", hvx_width / 1, u16(u8_1));
         check("vunpack(v*.ub)", hvx_width / 1, i16(u8_1));

@mcourteaux mcourteaux force-pushed the fix-vulkan-interleave branch from 5e2033b to 3cf15f0 Compare June 14, 2025 07:35
@mcourteaux
Copy link
Contributor Author

@abadams @vksnk The one failed buildbot is regarding correctness_predicated_store_load on the Hexagon target. However, the test doesn't emit any output on stderr/stdout. I tried running the test locally on my machine, but it seems I need the Hexagon SDK/runtime and presumably also such a DSP chip, which I don't. Any tips on how to debug what's going on with that failing test? SSH'ing into the machine is something I can't do.

@mcourteaux mcourteaux force-pushed the fix-vulkan-interleave branch 2 times, most recently from dc2d85d to e492a80 Compare June 28, 2025 20:15
@mcourteaux mcourteaux force-pushed the fix-vulkan-interleave branch from e492a80 to 05071c0 Compare August 1, 2025 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code_cleanup No functional changes. Reformatting, reorganizing, or refactoring existing code. dev_meeting Topic to be discussed at the next dev meeting enhancement New user-visible features or improvements to existing features. gpu release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vulkan shuffle bug: invalid SPIRV. Correct soname for libOpenCL.so?
4 participants