Comparing Datasets with TFDV

TFDV (TFX Data Validation) is a Python package that is part of TensorFlow eXtended ecosystem, and implement techniques for data validation and schema generation.

$ pip install tensorflow-data-validation

It is usually used in the data validation step of a TFX pipeline to check the data before it is feeded to the data processing and actual training steps.

It is also used to compare multiple datasets (e.g. training vs validation) and helps significantly different are they (e.g. different schema, missing values, etc).

In this TIP, we will use TFDV in a standalone mode to:

Load two datasets from CSV files
Generate satistics for each one
Compare these statistics

import tensorflow_data_validation as tfdv

# Load datasets and generate statistics
ds1_stats = tfdv.generate_statistics_from_csv(
  data_location='data_1.csv',
  delimiter=','
)
ds2_stats = tfdv.generate_statistics_from_csv(
  data_location='data_2.csv',
  delimiter=','
)

# Compare statistics
tfdv.visualize_statistics(
  lhs_statistics=ds1_stats, lhs_name='DS-I',
  rhs_statistics=ds2_stats, rhs_name='DS-II'
)

An example of an interactive visualization of the comparison of two datasets would look like this:

Note how numercial data vs categorical data are compared

Related tips

Track your TF model GPU memory consumption during training

English Text to speech with TensorFlowTTS

Improve read performance with TFRecordDataset