TFDV (TFX Data Validation) is a Python package that is part of TensorFlow eXtended ecosystem, and implement techniques for data validation and schema generation.

$ pip install tensorflow-data-validation

It is usually used in the data validation step of a TFX pipeline to check the data before it is feeded to the data processing and actual training steps.

It is also used to compare multiple datasets (e.g. training vs validation) and helps significantly different are they (e.g. different schema, missing values, etc).

In this TIP, we will use TFDV in a standalone mode to:

  • Load two datasets from CSV files
  • Generate satistics for each one
  • Compare these statistics
import tensorflow_data_validation as tfdv

# Load datasets and generate statistics
ds1_stats = tfdv.generate_statistics_from_csv(
  data_location='data_1.csv',
  delimiter=','
)
ds2_stats = tfdv.generate_statistics_from_csv(
  data_location='data_2.csv',
  delimiter=','
)

# Compare statistics
tfdv.visualize_statistics(
  lhs_statistics=ds1_stats, lhs_name='DS-I',
  rhs_statistics=ds2_stats, rhs_name='DS-II'
)

An example of an interactive visualization of the comparison of two datasets would look like this:

Note how numercial data vs categorical data are compared