-
Notifications
You must be signed in to change notification settings - Fork 32
Description
It turns out that passing free variables that are not of function type as parameters, as opposed to fetching them from the closure may be beneficial for tight loops. One example appears in the benchmark mlkit-bench/benchmarks/mandelbrot.sml. Here the two variables c_re
and c_im
are free in the function loop3
. Two optimisations will almost double the performance of the code:
- passing the variables
c_re
andc_im
as parameters (they will then be unboxed and passed in floating point registers.) - passing the constant
4.0
as a parameter to the function.
It is straightforward to verify the effect of the two optimisations; here is how the function loop3
can be hand-optimised by the programmer:
fun loop3 (count, z_re : real, z_im : real, c_re, c_im, four) =
if (count < maxCount)
then let val z_re_sq = z_re * z_re
val z_im_sq = z_im * z_im
in if (z_re_sq + z_im_sq) > four
then count
else let val z_re_im = z_re * z_im
in loop3 (count+0w1,
z_re_sq - z_im_sq + c_re,
z_re_im + z_re_im + c_im,
c_re,
c_im,
four)
end
end
else count
Notice that this optimisation is not in conflict with the function specialisation optimisation that specializes recursive functions that, invariantly, take other functions as parameters.
Here is the resulting assembler code - labels are slightly shortened:
FLab.loop3:
leaq -8(%rsp),%rsp
movq $0x400,%r11
cmpq %r11,%rax
jae .LLab.k33
movsd %xmm0,%xmm6
mulsd %xmm0,%xmm6
movsd %xmm1,%xmm10
mulsd %xmm1,%xmm10
movsd %xmm6,%xmm8
addsd %xmm10,%xmm8
ucomisd %xmm4,%xmm8
ja .LLab.k33
mulsd %xmm1,%xmm0
addq $0x1,%rax
subsd %xmm10,%xmm6
addsd %xmm2,%xmm6
addsd %xmm0,%xmm0
addsd %xmm3,%xmm0
movsd %xmm0,%xmm1
movsd %xmm6,%xmm0
leaq 8(%rsp),%rsp
jmp FLab.loop3
.p2align 0x4
.LLab.k33:
movq %rax,%rdi
leaq 8(%rsp),%rsp
popq %r11
jmp *%r11
Compared to the original version of mandelbrot, we have also changed the parameter count
to be of type word
, which turns out not to have a huge effect (only one jo __overflow
instruction is saved, after the addq
instruction).