Description
Here's an example table lookup problem program with assembly from godbolt:
https://godbolt.org/z/9WP19sfq8
Compare the assembly between the "simdless" versions and the result from the template:
template<typename T>
T categorize(T& out, T& lo, T& hi, const T& in) {
T l = in & 0xf;
//l = xsimd::swizzle(lo,l);
T h = in >> 4;
//h = xsimd::swizzle(hi,h);
return out = lookup(lo,l) & lookup(hi,h);
}
The "simdless" version not only produces less instructions to process the same amount of data, it really is just faster! Compiling with MSVC makes the difference worse.
However modeling the roughly equivalent simd-swizzle and using it as a callback:
std::array<uint8_t,16> swizzle_like(const std::array<uint8_t,16> &table, const std::array<uint8_t,16>& idxs) {
std::array<uint8_t,16> out;
for (size_t i = 0; i < idxs.size(); i++) {
out[i] = table[idxs[i]&0xf];
}
return out;
}
produces a similar effect.
There's likely an optimization in not using simd except where instructions like pshufb
are used.
For context here's some benchmarks using MSVC:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 325.10 | 3,075,962.91 | 3.0% | 0.01 | `byte lookup (sizzzle_like_but_256_wide)`
| 2,456.13 | 407,144.73 | 13.7% | 0.01 | :wavy_dash: `simd lookup (categorize)` (Unstable with ~288.1 iters. Increase `minEpochIterations` to e.g. b41)
NOTE: The results I'm seeing from godbolt appear to indicate that this is the case when using sse2
.
call xsimd::batch<unsigned char, xsimd::sse2> categorize<xsimd::batch<unsigned char, xsimd::sse2>>(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&)
When sse4.1
is enabled the performance is better. (I could not find a way to enable ssse3 with MSVC).
| 1,242.30 | 804,958.97 | 0.6% | 0.01 | `simd lookup (categorize)`
I also wanted to benchmark against an emulated or generic batch for comparison but I'm not sure how to express that type.
Edit: using clang-cl I managed to enable what I think is ssse3, for reference clang's performance is
| 38.09 | 26,255,319.15 | 0.3% | 0.01 | `byte lookup (ssse3)`
| 40.52 | 24,681,447.96 | 1.5% | 0.01 | `byte lookup (sse2)`
| 13.57 | 73,690,033.38 | 0.0% | 0.01 | `simd lookup (ssse3)`
| 204.76 | 4,883,829.00 | 7.0% | 0.01 | :wavy_dash: `simd lookup (sse2)` (Unstable with ~5,035.6 iters. Increase `minEpochIterations` to e.g. c4b4)
Take note that there's a massive performance difference between sse2
and ssse3
when using clang, such that it actually starts out performing the byte lookup! (What we'd hope).
Activity
serge-sans-paille commentedon Feb 11, 2025
For the 'selecting the right kernel when performance allows it', does the approach used in https://godbolt.org/z/vz6GGz8nx help?
Andersama commentedon Feb 11, 2025
@serge-sans-paille well I'm not entirely sure? I don't imagine your goal was to show there'd be a runtime exception when the performance would be poor? If I disable the
ssse3
flag compiler option it appears that what we would end up calling into a function that just throws.I think what you're trying to show me is I should be able to specialize what happens when
sse2
orgeneric
is used. Let me see what I can do from here.Andersama commentedon Feb 11, 2025
Well I appear to have managed to construct a similar ouput in assembly doing the following:
https://godbolt.org/z/9ceEWfvc6
Andersama commentedon Feb 11, 2025
Found a variation that improves on the code generated for
sse2
, the odd part is that when the indexs are stored into the array like the above the assembly is identical to the pessimizedgeneric
case (at least for clang).May be worth using as a uin8_t swizzle for
sse2
.serge-sans-paille commentedon Jun 13, 2025
FYI we now have a generic swizzle for constant mask (#1124)
Andersama commentedon Jun 15, 2025
Sweet, now I'll have to remember the context for what I was testing before and see how it shakes out.