Similarly to Quantization which we saw earlier, Pruning is another useful technique that can be leaveraged to reduce model size and complexity resulting in better latency and reduced inference cost.

Note: Quantization and Pruning are not exclusive, but we can use both to get additional benefits and unlock more performance improvements.

TensorFlow Model Optimization Toolkit also provides support for different Pruning techniques.

Install the toolkit with pip

$ pip install tensorflow_model_optimization

Import it as tfmot

import tensorflow_model_optimization as tfmot

Now we can use many of the pruning functionalities for Keras which are available under tfmot.sparsity.keras

We can prune a model during training by wrapping it with prune_low_magnitude like this

from tfmot.sparsity.keras import PolynomialDecay
from tfmot.sparsity.keras import prune_low_magnitude

pruning_schedule = PolynomialDecay(initial_sparsity=0.5, final_sparsity=0.8, begin_step=2000, end_step=4000)

model = create_model()
model_for_pruning = prune_low_magnitude(model, pruning_schedule=pruning_schedule)
...
model_for_pruning.fit(...)

Learn more about pruning in tf-keras with the following resources:

  • Pruning http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf
  • The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks - arxiv.org