|
| 1 | +# Gemmlowp's public entry points |
| 2 | + |
| 3 | +gemmlowp's public interface is defined in |
| 4 | +[public/gemmlowp.h](../public/gemmlowp.h). |
| 5 | + |
| 6 | +## GemmWithOutputPipeline |
| 7 | + |
| 8 | +The primary public entry point is: `GemmWithOutputPipeline`. |
| 9 | + |
| 10 | +A usage example is given in |
| 11 | +[doc/quantization_example.cc](quantization_example.cc). |
| 12 | + |
| 13 | +The prototype is: |
| 14 | + |
| 15 | +``` |
| 16 | +template <typename InputScalar, typename OutputScalar, typename BitDepthParams, |
| 17 | + MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder, |
| 18 | + typename OutputPipelineType, typename GemmContextType> |
| 19 | +void GemmWithOutputPipeline(GemmContextType* context, |
| 20 | + const MatrixMap<const InputScalar, LhsOrder>& lhs, |
| 21 | + const MatrixMap<const InputScalar, RhsOrder>& rhs, |
| 22 | + MatrixMap<OutputScalar, ResultOrder>* result, |
| 23 | + int lhs_offset, int rhs_offset, |
| 24 | + const OutputPipelineType& output_pipeline); |
| 25 | +``` |
| 26 | + |
| 27 | +A typical call looks like (from the [usage example](quantization_example.cc)): |
| 28 | + |
| 29 | +``` |
| 30 | +gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t, |
| 31 | + gemmlowp::DefaultL8R8BitDepthParams>( |
| 32 | + &gemm_context, uint8_lhs_matrix, uint8_rhs_matrix, |
| 33 | + &uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline); |
| 34 | +``` |
| 35 | + |
| 36 | +### Template parameters |
| 37 | + |
| 38 | +Typically only the 3 first template parameters need to be specified, the rest |
| 39 | +being automatically deduced from function parameters: |
| 40 | + |
| 41 | +* `InputScalar`: The scalar type of the LHS and RHS operands. At the moment, |
| 42 | + this must be `std::uint8_t`. |
| 43 | +* `OutputScalar`: The scalar type of the LHS and RHS operands. At the moment, |
| 44 | + this must be `std::uint8_t`. |
| 45 | +* `BitDepthParams`: Defines the bit format of the input and output matrices |
| 46 | + and the required accuracy of the computation. At the moment, the only |
| 47 | + non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See |
| 48 | + [less-than-8-bit.md](less-than-8-bit.md) for other values and the general |
| 49 | + idea of this, and how it may become more useful in the future. |
| 50 | + |
| 51 | +The other template parameters, which typically do not need to be specified, are: |
| 52 | + |
| 53 | +* `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or |
| 54 | + column-major) of the LHS, RHS, result matrices. See |
| 55 | + [public/map.h](../public/map.h). See the below performance note: we |
| 56 | + recommend using respectively RowMajor, ColMajor, ColMajor for optimal |
| 57 | + performance. |
| 58 | +* `OutputPipelineType`: the actual `std::tuple` type of the output pipeline. |
| 59 | + See below explanation of the `output_pipeline` parameter, and |
| 60 | + [output.md](output.md). |
| 61 | +* `GemmContextType`: the type of the `context` parameter. At the moment, this |
| 62 | + must be `gemmlowp::GemmContext`. |
| 63 | + |
| 64 | +### Function parameters |
| 65 | + |
| 66 | +The function parameters taken by `GemmWithOutputPipeline` are: |
| 67 | + |
| 68 | +* `context`: The `gemmlowp::GemmContext` object holding state and resources to |
| 69 | + be used for this gemmlowp call. |
| 70 | +* `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are |
| 71 | + `MatrixMap` objects, mapping external buffers as matrices, not owning data. |
| 72 | + See [public/map.h](../public/map.h). |
| 73 | +* `result`: pointer to the destination `MatrixMap` object, which must be |
| 74 | + already constructed, wrapping the external destination buffer with the |
| 75 | + wanted destination matrix shape and storage layout. No memory allocation |
| 76 | + will be performed by gemmlowp for the destination buffer. See |
| 77 | + [public/map.h](../public/map.h). |
| 78 | +* `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the |
| 79 | + LHS, RHS matrices respectively, as explained in |
| 80 | + [low-precision.md](low-precision.md). This is only the part of the |
| 81 | + quantization paradigm explained in [quantization.md](quantization.md) that |
| 82 | + needs to be implemented as operations on the operands; everything else is |
| 83 | + operations on the result, see `output_pipeline`. |
| 84 | +* `output_pipeline` is a `std::tuple` of output stages (see |
| 85 | + [public/output_stages.h](../public/output_stages.h)), specifying the output |
| 86 | + pipeline (see [output.md](output.md)). This is the part of the quantization |
| 87 | + paradigm explained in [quantization.md](quantization.md) that needs to be |
| 88 | + implemented as operations on the result matrix. |
| 89 | + |
| 90 | +### Performance note on storage orders. |
| 91 | + |
| 92 | +gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and |
| 93 | +result matrices. However, not all are equally optimized for. |
| 94 | + |
| 95 | +Because gemmlowp is primarily aimed at neural network inference workloads, |
| 96 | +optimization focus is on this particular combination of storage orders: |
| 97 | + |
| 98 | +* `LhsOrder=RowMajor` |
| 99 | +* `RhsOrder=ColMajor` |
| 100 | +* `ResultOrder=ColMajor` |
| 101 | + |
| 102 | +The rationale is that the LHS is typically the constant weights of a neural |
| 103 | +network layer (e.g. the weights of a Convolutional layer implemented as a matrix |
| 104 | +multiplication), while the RHS and result are neural network activations, |
| 105 | +respectively the input and output activations of the layer. |
| 106 | + |
| 107 | +Because the RHS and result are activations, we want them to share the same |
| 108 | +storage order -- so that one layer's output activations can be readily used as |
| 109 | +the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`. |
| 110 | + |
| 111 | +We also know from general considerations on matrix multiplication that it is |
| 112 | +slightly more efficient to have the direction of accumulation (the "depth" |
| 113 | +dimension) be the direction of contiguous storage in memory. That means that it |
| 114 | +is always going to be slightly easier and more efficient to have |
| 115 | +`LhsOrder=RowMajor` and `RhsOrder=ColMajor`. |
| 116 | + |
| 117 | +Putting this together, we arrive at gemmlowp's focus on the above-described |
| 118 | +combination of storage orders. |
| 119 | + |
| 120 | +Using other storage orders will typically mean taking less efficient paths in |
| 121 | +the packing and unpacking stages, see [packing.md](packing.md). The compute |
| 122 | +kernel stage ([kernel.md](kernel.md)) is unaffected. |
| 123 | + |
| 124 | +## GemmWithOutputPipelinePC |
| 125 | + |
| 126 | +This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of |
| 127 | +scalar. They are then broadcasted against LHS, RHS respectively. |
| 128 | + |
| 129 | +This is useful for some flavors of neural network inference with "per-channel |
| 130 | +quantization", whence the PC suffix. This has been useful in some settings where |
| 131 | +a neural network trained in float arithmetic was subsequently quantized. On the |
| 132 | +other hand, retraining neural networks for quantized inference tends to remove |
| 133 | +the need for per-channel quantization. For that reason, the long-term usefulness |
| 134 | +of this entry point is in question. |
| 135 | + |
| 136 | +## Gemm |
| 137 | + |
| 138 | +This is gemmlowp's original, now legacy and deprecated, entry point. See the |
| 139 | +section of [low-precision.md](low-precision.md) on the legacy quantization |
| 140 | +paradigm. Avoid in new code. |
| 141 | + |
| 142 | +## The eight_bit_int_gemm directory |
| 143 | + |
| 144 | +As explained in the top-level [README.md](../README.md#public-interfaces), this |
| 145 | +is entirely deprecated. |
0 commit comments