Skip to content

Pass free variables as parameters when possible for tight loops #111

@melsman

Description

@melsman

It turns out that passing free variables that are not of function type as parameters, as opposed to fetching them from the closure may be beneficial for tight loops. One example appears in the benchmark mlkit-bench/benchmarks/mandelbrot.sml. Here the two variables c_re and c_im are free in the function loop3. Two optimisations will almost double the performance of the code:

  1. passing the variables c_re and c_im as parameters (they will then be unboxed and passed in floating point registers.)
  2. passing the constant 4.0 as a parameter to the function.

It is straightforward to verify the effect of the two optimisations; here is how the function loop3 can be hand-optimised by the programmer:

fun loop3 (count, z_re : real, z_im : real, c_re, c_im, four) = 
  if (count < maxCount)
  then let val z_re_sq = z_re * z_re
           val z_im_sq = z_im * z_im
       in if (z_re_sq + z_im_sq) > four
          then count
          else let val z_re_im = z_re * z_im
               in loop3 (count+0w1,
                         z_re_sq - z_im_sq + c_re,
                         z_re_im + z_re_im + c_im,
                         c_re,
                         c_im,
                         four)
               end
       end
  else count

Notice that this optimisation is not in conflict with the function specialisation optimisation that specializes recursive functions that, invariantly, take other functions as parameters.

Here is the resulting assembler code - labels are slightly shortened:

FLab.loop3:
	leaq -8(%rsp),%rsp
	movq $0x400,%r11
	cmpq %r11,%rax
	jae .LLab.k33
	movsd %xmm0,%xmm6
	mulsd %xmm0,%xmm6
	movsd %xmm1,%xmm10
	mulsd %xmm1,%xmm10
	movsd %xmm6,%xmm8
	addsd %xmm10,%xmm8
	ucomisd %xmm4,%xmm8
	ja .LLab.k33
	mulsd %xmm1,%xmm0
	addq $0x1,%rax
	subsd %xmm10,%xmm6
	addsd %xmm2,%xmm6
	addsd %xmm0,%xmm0
	addsd %xmm3,%xmm0
	movsd %xmm0,%xmm1
	movsd %xmm6,%xmm0
	leaq 8(%rsp),%rsp
	jmp FLab.loop3
	.p2align 0x4
.LLab.k33:
	movq %rax,%rdi
	leaq 8(%rsp),%rsp
	popq %r11
	jmp *%r11

Compared to the original version of mandelbrot, we have also changed the parameter count to be of type word, which turns out not to have a huge effect (only one jo __overflow instruction is saved, after the addq instruction).

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions