In tensorflow tf.data.Dataset represents a collection of data that abstracts the complexity of the underlying pipeline needed to read data.

Read from memory

For quick protyping/testing or just to play with Tensorflow Data API (e.g. batching), we can build a Dataset simply by transforming in-memory objects:

Suppose we have an X and Y tensors, e.g representing synthetic data like this:

import tensorflow as tf

size = 10

X = tf.constant(range(size), dtype=tf.float32)
Y = X * 2 + 1

We can create a Dataset from those tensors using from_tensor_slices

dataset = tf.data.Dataset.from_tensor_slices((X, Y))

Now after creating the collection of examples, we can use tf.data to transform it as needed. For instance:

  • Repeat the entire dataset multiple times (e.g. same as number of training epochs)
  • Divide the data into equal size batches and dropping any remaining objects that does not add up to a complete batch
    dataset.repeat(epochs).batch(batch_size, drop_remainder=True)
    

We can check the output and confrim that each batch of the previous dataset is of same size

for i, (x, y) in enumerate(dataset):
  print("x:", x.numpy(), "y:", y.numpy())

Read from disk

In reality, your data would probably some files (e.g. csv), for instance split into train and test, or multiple files with a pattern like this

$ ls -l ../data/*.csv
-rw-r--r-- 1 jupyter jupyter 13590 Feb 16 11:37 ../data/train-01.csv
-rw-r--r-- 1 jupyter jupyter 79055 Feb 16 11:37 ../data/train-02.csv
-rw-r--r-- 1 jupyter jupyter 23114 Feb 16 11:37 ../data/train-03.csv

We can use make_csv_dataset to load those files into a single dataset as follows

# Define column names in same order as in CSV file
columns = ['x1', 'x2', 'y']
# Define default values for each column
defaults = [['na'], ['na'], [0.0]]
# Define files search pattern
pattern = '../data/train-*.csv'
# Read csv file
trainDS = tf.data.experimental.make_csv_dataset(pattern, 1, columns, defaults)

We can print the schema of the dataset with print(trainDS)

<PrefetchDataset shapes: OrderedDict([(x1, (1,)), (x2, (1,)), (y, (1,))]), types: OrderedDict([(x1, tf.float32), (x2, tf.float32), (y, tf.float32)])>

We can iterate over the first few element of this dataset using dataset.take(2) and print

for data in tempds.take(2):
  pprint({k: v.numpy() for k, v in data.items()})
  print("\n")
{'x1': array([1.], dtype=float32), 'x2': array([1.], dtype=float32), 'y': array([0.], dtype=float32)}

{'x1': array([2.], dtype=float32), 'x2': array([2.], dtype=float32), 'y': array([2.], dtype=float32)}