Skip to content

Commit 21db823

Browse files
committed
Add doc/public.md and make more documentation improvements
1 parent b39283c commit 21db823

File tree

7 files changed

+200
-34
lines changed

7 files changed

+200
-34
lines changed

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -72,11 +72,13 @@ gemmlowp's main public interface is in the `public/` subdirectory.
7272

7373
This is a headers-only library, so there is nothing to link to.
7474

75-
Usage documentation may be found in [doc/public.md](doc/public.md) .
75+
Usage documentation, and comments on the deprecation status of each public entry
76+
point, may be found in [doc/public.md](doc/public.md) .
7677

77-
A full, self-contained usage example, showing how to quantize float matrices
78-
and perform a quantized matrix multiplication approximating a float matrix
79-
multiplication, is given in `doc/quantization_example.cc`.
78+
A full, self-contained usage example, showing how to quantize float matrices and
79+
perform a quantized matrix multiplication approximating a float matrix
80+
multiplication, is given in
81+
[doc/quantization_example.cc](doc/quantization_example.cc).
8082

8183
### Old EightBitIntGemm legacy deprecated interface
8284

@@ -212,7 +214,7 @@ arm-linux-androideabi-g++ that does include NEON.
212214
The main benchmark is
213215

214216
```
215-
benchmark.cc
217+
test/benchmark.cc
216218
```
217219

218220
It doesn't need to be linked to any other source file. We recommend building

doc/design.md

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -135,19 +135,31 @@ for (int r = 0; r < rows; r += block_params.l2_rows) {
135135

136136
The files in `internal/` fall into a few categories:
137137

138-
There are two top-level GEMM implementations, * single_thread_gemm.h *
139-
multi_thread_gemm.h
138+
There are two top-level GEMM implementations,
139+
140+
* [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
141+
* [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)
140142

141143
They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
142-
[packing.md](packing.md)) implemented in the following files: * pack.h *
143-
compute.h * unpack.h * unpack.h in turn calls into output.h for the output
144-
pipeline (see [output.md](output.md))
144+
[packing.md](packing.md)) implemented in the following files:
145+
146+
* [internal/pack.h](../internal/pack.h)
147+
* [internal/compute.h](../internal/compute.h)
148+
* [internal/unpack.h](../internal/unpack.h)
149+
* This in turn calls into [internal/output.h](../internal/output.h) for
150+
the output pipeline (see [output.md](output.md))
145151

146152
The pack.h and unpack.h files contain generic templated code that can be
147-
overridden by optimized code in template specializations; see the NEON optimized
148-
code here: * pack_neon.h * unpack_neon.h
153+
overridden by optimized code in template specializations; for example, see the
154+
NEON optimized code here:
155+
156+
* [internal/pack_neon.h](../internal/pack_neon.h)
157+
* [internal/unpack_neon.h](../internal/unpack_neon.h)
158+
* This in turn calls into
159+
[internal/output_neon.h](../internal/output_neon.h)
149160

150161
The compute stage contains generic code in compute.h that only calls into
151162
optimized code through the Kernel::Run() entry point. Each kernel is basically
152-
just as struct offering a Run() implementation; see the NEON kernels in: *
153-
kernel_neon.h
163+
just as struct offering a Run() implementation; see the NEON kernels in:
164+
165+
* [internal/kernel_neon.h](../internal/kernel_neon.h)

doc/kernel.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -144,18 +144,18 @@ lhs and rhs matrices for optimally efficient traversal by the kernel. This
144144
depends on fine details of the kernel format, in ways that can only be
145145
efficiently handled by knowing these kernel format details at compile-time.
146146

147-
This is the reason why all the code in `internal/pack.h` is templated in the
148-
corresponding kernel format.
147+
This is the reason why all the code in [internal/pack.h](../internal/pack.h) is
148+
templated in the corresponding kernel format.
149149

150150
The code in internal/pack.h isn't tightly optimized by itself, but it is
151151
structured in such a way that the critical code is in a template,
152152
`PackingRegisterBlock`, that can easily be specialized to override the slow
153153
generic code with fast specific packing code for specific formats, on specific
154154
platforms.
155155

156-
See `internal/pack_neon.h` which provides NEON specializations of the packing
157-
code for the particular kernel formats that are used by the NEON kernels in
158-
`internal/kernel_neon.h`.
156+
See [internal/pack_neon.h](../internal/pack_neon.h) which provides NEON
157+
specializations of the packing code for the particular kernel formats that are
158+
used by the NEON kernels in [internal/kernel_neon.h](../internal/kernel_neon.h).
159159

160160
## Wrapping up: how to optimize gemmlowp for a CPU architecture
161161

@@ -166,5 +166,7 @@ dictate its required data layout; each data layout then also needs optimized
166166
packing code. The steps are thus:
167167

168168
1. Freely design a GEMM kernel with a freely chosen data layout.
169-
2. Implement the GEMM kernel, similar to `internal/kernel_neon.h`.
170-
3. Implement the optimized packing code, similar to `internal/pack_neon.h`.
169+
2. Implement the GEMM kernel, similar to
170+
[internal/kernel_neon.h](../internal/kernel_neon.h).
171+
3. Implement the optimized packing code, similar to
172+
[internal/pack_neon.h](../internal/pack_neon.h).

doc/low-precision.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@ mechanism by which gemmlowp becomes generic enough to support multiple 8bit
2525
computation paradigms, by allowing the user to set up a chain of transformations
2626
to be performed on internal 32bit accumulators to obtain the final outputs.
2727

28-
The public entry point in `public/gemmlowp.h` allowing to set un an arbitrary
29-
output pipeline is `GemmWithOutputPipeline`.
28+
The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to
29+
set un an arbitrary output pipeline is `GemmWithOutputPipeline`.
3030

3131
Refer to [quantization.md](quantization.md) for details of how one gets from
3232
first principles to the actual output pipelines to assemble for successful
@@ -51,7 +51,7 @@ int32 accumulators, to obtain the final outputs.
5151

5252
This older paradigm is the one exposed by the following entry points:
5353

54-
* In `public/gemmlowp.h`, the `Gemm` entry point.
54+
* In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point.
5555
* The deprecateed `eight_bit_int_gemm` directory.
5656

5757
Originally, gemmlowp started an implementation of the (now deprecated)
@@ -171,7 +171,8 @@ In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
171171
to prepare them for efficient repeated traversal by the kernel), we compute the
172172
sum of each row of the lhs block and the sum of each column of the rhs block.
173173

174-
See in `internal/pack.h`, in the PackedSideBlock class, the following member:
174+
See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the
175+
following member:
175176

176177
```
177178
// Handle on the additional buffer backing the vector of sums of slices
@@ -186,4 +187,4 @@ After these rank one updates have been computed at the packing stage, they are
186187
ignored at the compute kernel stage, since that stage is only concerned with the
187188
first of the four terms in (2); they are only used at the unpacking stage. See
188189
the default/reference implementation, `UnpackResultImpl`, in
189-
`internal/unpack.h`.
190+
[internal/unpack.h](../internal/unpack.h).

doc/output.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,13 @@ output pipeline.
2424
## Usage
2525

2626
The gemmlowp entry point allowing to use an arbitrary output pipeline is
27-
`GemmWithOutputPipeline` in `public/gemmlowp.h`.
27+
`GemmWithOutputPipeline` in [public/gemmlowp.h](../public/gemmlowp.h).
2828

2929
The output pipeline is specified as a `std::tuple` of "output stages", each of
3030
which defining an elementary arithmetic transformation.
3131

32-
All available output stages are defined in `public/output_stages.h`.
32+
All available output stages are defined in
33+
[public/output_stages.h](../public/output_stages.h).
3334

3435
## Example usage
3536

@@ -49,4 +50,4 @@ TestOutputStages
4950
Separately, a self-contained example showing how to use gemmlowp to compute a
5051
quantized matrix multiplication with a sounds quantization paradigm, is here:
5152

52-
`doc/quantization_example.cc`
53+
[doc/quantization_example.cc](quantization_example.cc)

doc/public.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Gemmlowp's public entry points
2+
3+
gemmlowp's public interface is defined in
4+
[public/gemmlowp.h](../public/gemmlowp.h).
5+
6+
## GemmWithOutputPipeline
7+
8+
The primary public entry point is: `GemmWithOutputPipeline`.
9+
10+
A usage example is given in
11+
[doc/quantization_example.cc](quantization_example.cc).
12+
13+
The prototype is:
14+
15+
```
16+
template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
17+
MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
18+
typename OutputPipelineType, typename GemmContextType>
19+
void GemmWithOutputPipeline(GemmContextType* context,
20+
const MatrixMap<const InputScalar, LhsOrder>& lhs,
21+
const MatrixMap<const InputScalar, RhsOrder>& rhs,
22+
MatrixMap<OutputScalar, ResultOrder>* result,
23+
int lhs_offset, int rhs_offset,
24+
const OutputPipelineType& output_pipeline);
25+
```
26+
27+
A typical call looks like (from the [usage example](quantization_example.cc)):
28+
29+
```
30+
gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
31+
gemmlowp::DefaultL8R8BitDepthParams>(
32+
&gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
33+
&uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
34+
```
35+
36+
### Template parameters
37+
38+
Typically only the 3 first template parameters need to be specified, the rest
39+
being automatically deduced from function parameters:
40+
41+
* `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
42+
this must be `std::uint8_t`.
43+
* `OutputScalar`: The scalar type of the LHS and RHS operands. At the moment,
44+
this must be `std::uint8_t`.
45+
* `BitDepthParams`: Defines the bit format of the input and output matrices
46+
and the required accuracy of the computation. At the moment, the only
47+
non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
48+
[less-than-8-bit.md](less-than-8-bit.md) for other values and the general
49+
idea of this, and how it may become more useful in the future.
50+
51+
The other template parameters, which typically do not need to be specified, are:
52+
53+
* `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
54+
column-major) of the LHS, RHS, result matrices. See
55+
[public/map.h](../public/map.h). See the below performance note: we
56+
recommend using respectively RowMajor, ColMajor, ColMajor for optimal
57+
performance.
58+
* `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
59+
See below explanation of the `output_pipeline` parameter, and
60+
[output.md](output.md).
61+
* `GemmContextType`: the type of the `context` parameter. At the moment, this
62+
must be `gemmlowp::GemmContext`.
63+
64+
### Function parameters
65+
66+
The function parameters taken by `GemmWithOutputPipeline` are:
67+
68+
* `context`: The `gemmlowp::GemmContext` object holding state and resources to
69+
be used for this gemmlowp call.
70+
* `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
71+
`MatrixMap` objects, mapping external buffers as matrices, not owning data.
72+
See [public/map.h](../public/map.h).
73+
* `result`: pointer to the destination `MatrixMap` object, which must be
74+
already constructed, wrapping the external destination buffer with the
75+
wanted destination matrix shape and storage layout. No memory allocation
76+
will be performed by gemmlowp for the destination buffer. See
77+
[public/map.h](../public/map.h).
78+
* `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
79+
LHS, RHS matrices respectively, as explained in
80+
[low-precision.md](low-precision.md). This is only the part of the
81+
quantization paradigm explained in [quantization.md](quantization.md) that
82+
needs to be implemented as operations on the operands; everything else is
83+
operations on the result, see `output_pipeline`.
84+
* `output_pipeline` is a `std::tuple` of output stages (see
85+
[public/output_stages.h](../public/output_stages.h)), specifying the output
86+
pipeline (see [output.md](output.md)). This is the part of the quantization
87+
paradigm explained in [quantization.md](quantization.md) that needs to be
88+
implemented as operations on the result matrix.
89+
90+
### Performance note on storage orders.
91+
92+
gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
93+
result matrices. However, not all are equally optimized for.
94+
95+
Because gemmlowp is primarily aimed at neural network inference workloads,
96+
optimization focus is on this particular combination of storage orders:
97+
98+
* `LhsOrder=RowMajor`
99+
* `RhsOrder=ColMajor`
100+
* `ResultOrder=ColMajor`
101+
102+
The rationale is that the LHS is typically the constant weights of a neural
103+
network layer (e.g. the weights of a Convolutional layer implemented as a matrix
104+
multiplication), while the RHS and result are neural network activations,
105+
respectively the input and output activations of the layer.
106+
107+
Because the RHS and result are activations, we want them to share the same
108+
storage order -- so that one layer's output activations can be readily used as
109+
the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.
110+
111+
We also know from general considerations on matrix multiplication that it is
112+
slightly more efficient to have the direction of accumulation (the "depth"
113+
dimension) be the direction of contiguous storage in memory. That means that it
114+
is always going to be slightly easier and more efficient to have
115+
`LhsOrder=RowMajor` and `RhsOrder=ColMajor`.
116+
117+
Putting this together, we arrive at gemmlowp's focus on the above-described
118+
combination of storage orders.
119+
120+
Using other storage orders will typically mean taking less efficient paths in
121+
the packing and unpacking stages, see [packing.md](packing.md). The compute
122+
kernel stage ([kernel.md](kernel.md)) is unaffected.
123+
124+
## GemmWithOutputPipelinePC
125+
126+
This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
127+
scalar. They are then broadcasted against LHS, RHS respectively.
128+
129+
This is useful for some flavors of neural network inference with "per-channel
130+
quantization", whence the PC suffix. This has been useful in some settings where
131+
a neural network trained in float arithmetic was subsequently quantized. On the
132+
other hand, retraining neural networks for quantized inference tends to remove
133+
the need for per-channel quantization. For that reason, the long-term usefulness
134+
of this entry point is in question.
135+
136+
## Gemm
137+
138+
This is gemmlowp's original, now legacy and deprecated, entry point. See the
139+
section of [low-precision.md](low-precision.md) on the legacy quantization
140+
paradigm. Avoid in new code.
141+
142+
## The eight_bit_int_gemm directory
143+
144+
As explained in the top-level [README.md](../README.md#public-interfaces), this
145+
is entirely deprecated.

doc/quantization.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Building a quantization paradigm from first principles
22

33
**TLDR:** If you prefer example code over theory, look at
4-
`doc/quantization_example.cc`.
4+
[doc/quantization_example.cc](quantization_example.cc).
55

66
## Overview
77

@@ -304,7 +304,8 @@ paradigm, i.e. implementing the precise computation detailed in the previous
304304
section (equation (5)), is
305305
`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`.
306306

307-
Please refer to the comment explaining it in `public/output_stages.h`.
307+
Please refer to the comment explaining it in
308+
[public/output_stages.h](../public/output_stages.h).
308309

309310
## How this differs from the older legacy gemmlowp quantization paradigm
310311

@@ -315,8 +316,9 @@ implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output
315316
stage implementing the new paradigm,
316317
`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`.
317318

318-
Please refer to the comments in `public/output_stages.h` for details about these
319-
two output stages and how they differ.
319+
Please refer to the comments in
320+
[public/output_stages.h](../public/output_stages.h) for details about these two
321+
output stages and how they differ.
320322

321323
Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:
322324

@@ -341,4 +343,5 @@ Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:
341343
## Example code illustrating the new quantization paradigm
342344

343345
Example code showing how to perfom a quantized matrix multiplication in the
344-
quantization paradigm discussed here is in `doc/quantization_example.cc`.
346+
quantization paradigm discussed here is in
347+
[doc/quantization_example.cc](quantization_example.cc).

0 commit comments

Comments
 (0)