Add AVX/AVX2 support

Add support for packed vector instructions for floating point and integer operations.

- [ ] Design and implement a generic signature that supports various explicit operations (e.g., mul, add) on, for instance, 64-bit floating point values (in e.g., 256bit packed vector registers).

- [ ] Design and implement various structures that matches the above signature (e.g., for packed 64-bit floats and for packed 64-bit integers). Make use of the MLKit `prim` feature for intrinsics.

- [ ] Implement support for the intrinsics in the `Compiler/Lambda/LambdaExp` MLKit intermediate language to be targeted by the operations in the structures. Implement support for the operations all the way down to the `Compiler/Backend/X64/CodeGenX64` / `Compiler/Backend/X64/CodeGenUtilX64` modules (e.g., extend the operations in `Compiler/Backend/PrimName.sml`)

- [ ] Implement operations for loading from and storing to memory. We can use the `BlockF64` values for representing and allocating memory.

## Discussion.

An important aspect here is that the implementation will have to include boxing-operations that implicitly box the vector values into memory. The optimiser can then eliminate box-unboxing and unbox-box compositions. The reason is that, in general, it is impossible to ensure that a value is not passed to a generic function, stored in a data structure, or captured in a closure; it is assumed that all values can be represented in one 64-bit word (perhaps tagged with the LSB being 1, if the GC should not traverse the value). 

I foresee some issues with implementing support for register allocation on the `ymm` registers. Also, We must make sure that the optimiser (i.e., module `Compiler/Lambda/OptLambda`) does not pass wide 256-bit values to generic functions. Also, such values cannot be passed as arguments to functions and neither can they be stored in closures. They are solely for operations in basic blocks. Ideally, these restrictions could be enforced in `Compiler/Lambda/LambdaStatSem`. 

An interesting application for these operations would be to make use of the operations to implement efficiently some of the operations in the `Real64Array` / `Real64Vector` structures.

## References

1.  [Book](https://books.google.dk/books?id=wPt9DwAAQBAJ&pg=PA470&lpg=PA470&dq=VMULPD&source=bl&ots=pSpqjgNLtv&sig=ACfU3U0AjfbP46qW0WzWd8zXvA64FZwXYg&hl=en&sa=X&ved=2ahUKEwikyJnFpYjpAhWLw6YKHdtuDkUQ6AEwBHoECAoQAQ#v=onepage&q=VMULPD&f=false)

2. [Optimizing Subroutines in Assembly Language](https://www.agner.org/optimize/optimizing_assembly.pdf)

3. [x86 and amd64 instruction reference](https://www.felixcloutier.com/x86/)

4. [Formally optimal boxing](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.461.2127&rep=rep1&type=pdf)

5. [Notes on x86-64 Programming](https://www.lri.fr/~filliatr/ens/compil/x86-64.pdf)

6. [Twitter-post on the AVX landscape](https://twitter.com/InstLatX64/status/969560033922035713)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AVX/AVX2 support #43

Discussion.

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add AVX/AVX2 support #43

Description

Discussion.

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions