Skip to content

Commit 5975182

Browse files
committed
Add installation instructions to README
1 parent 007ad1d commit 5975182

File tree

1 file changed

+43
-12
lines changed

1 file changed

+43
-12
lines changed

README.md

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,48 @@ An interactive version of this site is available [here](https://huggingface.gith
66

77

88
***[Movement](https://arxiv.org/abs/2005.07683) [pruning](https://github.com/huggingface/transformers/tree/master/examples/research_projects/movement-pruning)** has been proved as a **very efficient
9-
method to prune networks in a unstructured manner**. High levels of sparsity can be reached with a minimal of accuracy loss.
9+
method to prune networks in a unstructured manner**. High levels of sparsity can be reached with a minimal of accuracy loss.
1010
The resulting sparse networks can be **compressed heavily**,
1111
saving a lot of permanent storage space on servers or devices, and bandwidth, an important advantage for edge devices.
1212
**But efficient inference with unstructured sparsity is hard.**
1313
Some degree of structure is necessary to use the intrinsic parallel nature of today hardware.
1414
**Block Movement Pruning** work extends the original method and explore **semi-structured and structured variants** of Movement Pruning.
1515
You can read more about block sparsity and why it matters for performance on these [blog](https://medium.com/huggingface/is-the-future-of-neural-networks-sparse-an-introduction-1-n-d03923ecbd70) [posts](https://medium.com/huggingface/sparse-neural-networks-2-n-gpu-performance-b8bc9ce950fc).*
1616

17-
# How to use
17+
## Documentation
1818
The documentation is [here](docs/HOWTO.md).
1919

20+
## Installation
21+
22+
### User installlation
23+
24+
You can install `nn_pruning` using `pip` as follows:
25+
26+
```
27+
python -m pip install -U nn_pruning
28+
```
29+
30+
### Developer installation
31+
32+
To install the latest state of the source code, first clone the repository
33+
34+
```
35+
git clone https://github.com/huggingface/nn_pruning.git
36+
```
37+
38+
and then install the required dependencies:
39+
40+
```
41+
cd nn_pruning
42+
python -m pip install -e ".[dev]"
43+
```
44+
45+
After the installation is completed, you can launch the test suite from the root of the repository
46+
47+
```
48+
pytest nn_pruning
49+
```
50+
2051
## Results
2152

2253
### Squad V1
@@ -34,7 +65,7 @@ The "BERT version" column shows which base network was pruned.
3465
The parameter count column is relative to linear layers, which contain most of the model parameters (with the embeddings being most of the remaining parameters).
3566

3667
**F1 difference, speedups and parameters counts are all relative to BERT-base to ease practical comparison.**
37-
68+
3869
| Model | Type | method | Params | F1 | F1 diff | Speedup |
3970
|--------------------------------------------------------------------------------------------------|---------|-------------|---------|---------|---------|---------|
4071
|**[#1](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad)** |**large**|**-** |**+166%**|**93.15**|**+4.65**|**0.35x**|
@@ -56,23 +87,23 @@ The parameter count column is relative to linear layers, which contain most of t
5687
- network #3: pruned from BERT-large, it is finally 40% smaller but significantly better than a BERT-base, and still as fast.
5788

5889
That means that starting from a larger networks is beneficial on all metrics, even absolute size, something observed in the [Train Large, Then Compress](https://arxiv.org/abs/2002.11794) paper.
59-
90+
6091
- network #4: we can shrink BERT-base by ~60%, speedup inference by 1.8x and still have a ***better*** network
6192
- networks #N: we can select a **tradeoff between speed and accuracy**, depending on the final application.
6293
- last network: pruned using a slightly different "structured pruning" method that gives faster networks but with a significant drop in F1.
6394

6495
**Additional remarks**
6596
- The parameter reduction of the BERT-large networks are actually higher compared to the original network: 40% smaller than BERT-base means actually 77% smaller than BERT-large.
6697
We kept here the comparison with BERT-base numbers as it's what matters on a practical point of view.
67-
- The "theoretical speedup" is a speedup of linear layers (actual number of flops), something that seems to be equivalent to the measured speedup in some papers.
98+
- The "theoretical speedup" is a speedup of linear layers (actual number of flops), something that seems to be equivalent to the measured speedup in some papers.
6899
The speedup here is measured on a 3090 RTX, using the HuggingFace transformers library, using Pytorch cuda timing features, and so is 100% in line with real-world speedup.
69100

70101
### Example "Hybrid filled" Network
71102

72103
Here are some visualizations of the pruned network [#7](https://huggingface.co/madlag/bert-base-uncased-squadv1-x2.44-f87.7-d26-hybrid-filled-v1).
73104
It is using the "Hybrid filled" method:
74-
- Hybrid : prune using blocks for attention and rows/columns for the two large FFNs.
75-
- Filled : remove empty heads and empty rows/columns of the FFNs, then re-finetune the previous network, letting the zeros in non-empty attention heads evolve and so regain some accuracy while keeping the same network speed.
105+
- Hybrid : prune using blocks for attention and rows/columns for the two large FFNs.
106+
- Filled : remove empty heads and empty rows/columns of the FFNs, then re-finetune the previous network, letting the zeros in non-empty attention heads evolve and so regain some accuracy while keeping the same network speed.
76107

77108
You can see that the results linear layers are all actually "dense" (hover on the graph to visualize them).
78109

@@ -83,7 +114,7 @@ You can see here the pruned heads for each layer:
83114
![Hybrid Filled Head Pruning](docs/assets/media/squadv1/models/network_filled/pruning_info.png)
84115

85116

86-
### Comparison with state of the art
117+
### Comparison with state of the art
87118
If we plot the F1 of the full set of pruned networks against the speedup, we can see that we outperform fine-tuned TinyBERT and Distilbert by some margin.
88119
MobileBert seems significantly better, even with the "no OPT" version presented here, which does not contain the LayerNorm optimization used in the much faster version of MobileBERT.
89120
An interesting future work will be to add those optimizations to the pruning tools.
@@ -94,7 +125,7 @@ Even in terms of saved size, we get smaller networks for the same accuracy (exce
94125

95126
![SQuAD fill rate](docs/assets/media/squadv1/graphs/summary_fill_rate.png)
96127

97-
### GLUE/MNLI
128+
### GLUE/MNLI
98129

99130
The experiments were done on BERT-base.
100131
Significant speedups were obtained, even if the results are a bit behind compared to the SQuAD results.
@@ -112,11 +143,11 @@ Here is a selection of networks, with the same rules as for the SQuAd table:
112143

113144

114145

115-
### Comparison with state of the art
146+
### Comparison with state of the art
116147
(This is WIP : Some more runs are needed to check the performance versus MobileBERT and TinyBert at same level of speed. Some better hyperparameters may help too.)
117148

118149
From the following graphs, we see that the speed is a bit lower compared to TinyBERT, and roughly in line with MobileBERT.
119-
In terms of sparsity, the precision is a bit lower than MobileBERT and TinyBERT.
150+
In terms of sparsity, the precision is a bit lower than MobileBERT and TinyBERT.
120151
On both metrics it's better than DistilBERT by some significant margin.
121152

122153
![MNLI v1 speedup](docs/assets/media/mnli/graphs/summary_speedup.png)
@@ -126,4 +157,4 @@ On both metrics it's better than DistilBERT by some significant margin.
126157

127158
## Related work
128159
[pytorch_block_sparse](https://github.com/huggingface/pytorch_block_sparse) is a CUDA Implementation of block sparse kernels for linear layer forward and backward propagation.
129-
It's not needed to run the models pruned by the nn_pruning tools, as it's not fast enough yet to be competitive with dense linear layers: just pruning heads is faster, even if those heads still contain some inner sparsity.
160+
It's not needed to run the models pruned by the nn_pruning tools, as it's not fast enough yet to be competitive with dense linear layers: just pruning heads is faster, even if those heads still contain some inner sparsity.

0 commit comments

Comments
 (0)