# TensorFlow Distributed Training on Kubeflow

## Overview

Deep learning models are getting larger and larger (over 130 billion parameters) and requires more and more data for training in order to achieve higher performance.

Training such models is not possible on one machine, but rather requires a fleet of machines. Distributed training aims to provide answers to this problem with the following possible approaches.

### Model Parallelism

In Model Parallelism, the model parameters are distributed across multiple machines as it does not fit on a single one. Each worker will be responsible on updating the parameters is responsible for with a forward and backward passes. In this paradigm, a worker communicates with the subset of works that hold the layers it depends on during the forward pass and those that depend on during the backward pass.

### Data Parallelism

In Data Parallelism, each worker host the whole model but is given a subset of the data which is potentially different from the one given to another worker. In this paradigm, there is no need for workers to communicate with each other, but rather a central worker (usually called a Parameter Server) is responsible for:

• aggregating the losses it receives from every workers’ forward pass and
• replying back to the workers with the updated weights.

In TensorFlow for instance, one could train a model with the Data Parallelism paradigm easily as illustrated in the following snippet

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='sotfmax')
])


## Distributed Training in Kubeflow

The Kubeflow project is a complex project that aims at simpliying the provisioning of a Machine Learning infrastructure. It is built on top of Kubernetes and thus reuses k8s core components (pods, services, etc.) and adapt them for the ML use cases.

Kubeflow training is a group Kubernetes Operators that add to Kubeflow support for distributed training of Machine Learning models using different frameworks, the current release supports:

• TensorFlow through tf-operator (also know as TFJob)
• PyTorch through pytorch-operator
• Apache MXNet through mxnet-operator
• MPI through mpi-operator

See https://www.kubeflow.org/docs/components/training/ for more details.

Rest of this post, we will use:

• Docker Hub to host a tensorflow-based container image that contains the model training logic.
• TFJob to describe the processes that will run the training in a distributed fashion.

### Create a training image

Create a repo on Docker Hub called tf-dist-mnist-test and login locally with docker login

Clone the Kubeflow tf-operator project and navigate to the mnist example

$git clone https://github.com/kubeflow/tf-operator$ cd tf-operator/examples/v1/dist-mnist


Build the mnist locally

$docker build -f Dockerfile -t <DOCKER_HUB_USERNAME>/tf-dist-mnist-test:1.0 ./  Push the image you just built to Docker Hub $ docker push <DOCKER_HUB_USERNAME>/tf-dist-mnist-test:1.0
The push refers to repository [docker.io/<DOCKER_HUB_USERNAME>/tf-dist-mnist-test]
3d980aca20f2: Pushed
c04a36d9e118: Pushed
d964bb768e1a: Pushed
db582379df14: Pushed
5bb39b263596: Pushed
02efdb75efd8: Pushed
dee07873361c: Pushed
0b029684a0e5: Pushed
6f4ce6b88849: Pushed
92914665e7f6: Pushed
c98ef191df4b: Pushed
9c7183e0ea88: Pushed
ff986b10a018: Pushed
1.0: digest: sha256:28fe6870f37380b065f7cda1d71f9401709c5a2c7d0dca55563cbd1b14d18911 size: 3038


### Submit training job

A TFJob is a resource with a YAML representation like the one below: (before submitting your job relpace <DOCKER_HUB_USERNAME> with your Docker Hub username)

Each tfReplicaSpecs defines a set of TensorFlow processes. Under this spec we define different types of processes, a PS (Parameter Server) and Workers with their respective replication factor and container image.

Submit TFJob distributed training job

$kubectl apply -f mnist-tensorflow-job.yaml tfjob.kubeflow.org/mnist-tensorflow-job created  Get all TFJob resources which were previously created: $ kubectl get tfjob
NAME                   STATE       AGE
mnist-tensorflow-job   Succeeded   3m26s


Check the status of a speific TFJob resource:

$kubectl describe tfjob mnist-tensorflow-job  The output may look like below. Notice that some of the pods were already deleted, in the YAML manifest we set the number of workers to 2 plus a PS (Parameter Server) Check all the pods created by this TFJob $ kubectl get pod | grep mnist-tensorflow-job
mnist-tensorflow-job-worker-0   0/1     Completed   0          5m43s


To get the logs of any of this TFJob pods use the following command:

\$ kubectl get logs mnist-tensorflow-job-worker-0


The next steps would be to actually create own TensorFlow training logic, package the container image as described in this post and submit the job.