Read data into Tensorflow Dataset

In tensorflow tf.data.Dataset represents a collection of data that abstracts the complexity of the underlying pipeline needed to read data.

Read from memory

For quick protyping/testing or just to play with Tensorflow Data API (e.g. batching), we can build a Dataset simply by transforming in-memory objects:

Suppose we have an X and Y tensors, e.g representing synthetic data like this:

import tensorflow as tf

size = 10

X = tf.constant(range(size), dtype=tf.float32)
Y = X * 2 + 1

We can create a Dataset from those tensors using from_tensor_slices

dataset = tf.data.Dataset.from_tensor_slices((X, Y))

Now after creating the collection of examples, we can use tf.data to transform it as needed. For instance:

Repeat the entire dataset multiple times (e.g. same as number of training epochs)
Divide the data into equal size batches and dropping any remaining objects that does not add up to a complete batch
```
dataset.repeat(epochs).batch(batch_size, drop_remainder=True)
```

We can check the output and confrim that each batch of the previous dataset is of same size

for i, (x, y) in enumerate(dataset):
  print("x:", x.numpy(), "y:", y.numpy())

Read from disk

In reality, your data would probably some files (e.g. csv), for instance split into train and test, or multiple files with a pattern like this

$ ls -l ../data/*.csv
-rw-r--r-- 1 jupyter jupyter 13590 Feb 16 11:37 ../data/train-01.csv
-rw-r--r-- 1 jupyter jupyter 79055 Feb 16 11:37 ../data/train-02.csv
-rw-r--r-- 1 jupyter jupyter 23114 Feb 16 11:37 ../data/train-03.csv

We can use make_csv_dataset to load those files into a single dataset as follows

# Define column names in same order as in CSV file
columns = ['x1', 'x2', 'y']
# Define default values for each column
defaults = [['na'], ['na'], [0.0]]
# Define files search pattern
pattern = '../data/train-*.csv'
# Read csv file
trainDS = tf.data.experimental.make_csv_dataset(pattern, 1, columns, defaults)

We can print the schema of the dataset with print(trainDS)

<PrefetchDataset shapes: OrderedDict([(x1, (1,)), (x2, (1,)), (y, (1,))]), types: OrderedDict([(x1, tf.float32), (x2, tf.float32), (y, tf.float32)])>

We can iterate over the first few element of this dataset using dataset.take(2) and print

for data in tempds.take(2):
  pprint({k: v.numpy() for k, v in data.items()})
  print("\n")

{'x1': array([1.], dtype=float32), 'x2': array([1.], dtype=float32), 'y': array([0.], dtype=float32)}

{'x1': array([2.], dtype=float32), 'x2': array([2.], dtype=float32), 'y': array([2.], dtype=float32)}

Read from memory

Read from disk

Related tips

Track your TF model GPU memory consumption during training

English Text to speech with TensorFlowTTS

Improve read performance with TFRecordDataset