Some tips to speed up data processing with TFRecordDataset

Concurrent files processing with interleave

Use interleave in TFRecordDataset to process many input files concurrently

filenames = ["./file-01.csv", "./file-02.csv", "./file-03.csv", "./file-04.csv", ...]
dataset =
def read_file(filename):
dataset = dataset.interleave(lambda x: read_file(x)), cycle_length=2, block_length=4,, deterministic=False)

In this example we preprocess 2 files concurrently with cycle_length=2, interleave blocks of 4 records from each file with block_length=4, and let Tensorflow decide how many parallel calls are needed with

Prefetch data to improve throughput

Use prefetch to improves latency and throughput during training and avoid GPU starvation.

dataset =
dataset.prefetch(2) # prefetches 2 elements
dataset.batch(3).prefetch(2) # prefetches two batches of 3 elements

Note using this comes at the cost of using additional memory to store prefetched elements.

More data performance tips here