Data Validation with TensorFlow eXtended (TFX)

03 Nov 2020 by dzlab

tfx-components

In a previous article, we discussed the we can ingest data from various sources into a TFX pipeline. In this article, we will discuss the next step of a TFX pipeline which involves schema generation and data validation.

This step checks the data coming through the pipeline, and catches any changes that could impact the next steps (i.e. feature engineering and training). In TFX, this is implemented via the Tensorflow Data Validation (TFDV) Python library which can be installed with pip.

$ pip install tensorflow-data-validation

TFDV can be used for generating schemas and statistics about the distribution of every feature in the dataset. Such information is useful for comparing multiple datasets (e.g. training vs inference datasets) and reporting:

Anomalies related to schema changes
Statistical differences in the features distribution

TFDV also offers visualization capabilities for comparing datasets based on the Google PAIR Facets project.

Describing data with TFDV

The usual workflow when using TFDV during training is as follows:

Generate statistics for the data
Use those statistics to generate a schema for each feature
Visualize the schema and statistics and manually inspect them
Update the schema if needed

Then when new data comes in, the workflow becomes:

Generate statistics and schema for the new data.
Visualize side by side the statistics to the training data statistics.
Validate the statistics against the one from training to detect anomalies.

Generating Statistics

Before any data validation we need to generate statistics, we can use any of TFDV helper functions:

tfdv.generate_statistics_from_csv when data is in a CSV file
tfdv.generate_statistics_from_dataframe when data is in a Pandas DataFrame
tfdv.generate_statistics_from_tfrecord when data is in a TFRecord file

Here is an example reading from a CSV file with TFDV and generating statistics for each feature.

stats = tfdv.generate_statistics_from_csv(data_location='data/train.csv', delimiter=',')

We can manually inspect those statistics using tfdv.visualize_statistics

tfdv.visualize_statistics(stats)

You can notice that TFDV generates different types of statistics based on the type of features.

For numerical features, TFDV computes for every feature:
- Count of records
- Number of missing (i.e. null values)
- Histogram of values
- Mean and standard deviation
- Minimum and maximum values
- Percentage of zero values
For categorical features, TFDV provides:
- Count of values
- Percentage of missing values
- Number of unique values
- Average string length
- Count for each label and its rank

Generating Schema

Once statistics are generated, the next step is to generate a schema for our dataset. This schema will map each feature in the dataset to a type (float, bytes, etc.). Also define feature boundaries (min, max, distribution of values and missings, etc.).

With TFDV, we generate schema from statistincs using tfdv.infer_schema as follows:

schema = tfdv.infer_schema(stats)

For example a numercial may have a schema like this:

feature {
  name: "Num"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}

On the other hand, a categorical feature may have a schema that looks like this:

feature {
  name: "Cat"
  type: BYTES
  domain: "Cat"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
string_domain {
  name: "Cat"
  value: "A"
  value: "B"
  value: "C"
}

TFDV provides a API to print a summary of each feature schema using tfdv.display_schema

tfdv.display_schema(schema)

Feature name	Type	Presence	Valency	Domain
‘Num’	FLOAT	required		-
‘Cat’	STRING	required		‘Cat’

Domain
‘Cat’	‘A’, ‘B’, ‘C’

In this visualization, the columns stand for:

Presence indicates whether the feature must be present in 100% of examples (required) or not (optional).
Valency indicates the number of values required per training example. In the case of categorical features, single indicates that each training example must have exactly one category for the feature.

Updating schema

TFDV lets you update the schema according to your domain knowledge of the data if you are not satisfied by the auto-generated schema.

The steps to update the schema are as follows:

First, load the schema from its serialized location:

original_schema = tfdv.load_schema_text(schema_location)

Then, we update this feature, for instance so that it is required in 80% of cases instead of 100%:

Num_feature = tfdv.get_feature(schema, 'Num')
Num_feature.presence.min_fraction = 0.8

Another case, would be if a categorical feature was missing possible value. We can add this new label as follows:

Colors_domain = tfdv.get_domain(schema, 'Colors')
Colors_domain.value.insert(3, 'Yellow')
Colors_domain.value
# ['Red', 'Green', 'Blue', 'Yellow']

Finally, we can store the schema back as follows:

tfdv.write_schema_text(schema, schema_location)

Spotting issues with TFDV

In the previous sections, we introduced the basics of TFDV and how to generate statistics and a schema for a dataset. In this section, we will see how to use TFDV to spot issues in the data.

Comparing Datasets

Suppose we have two datasets one for training the over for evaluation. TFDV lets you determine how representative is the evaluation dataset of to the training one. More precisely, it helps you answer questions like:

Does the evaluation data have a similar schema as the training dataset?
Does the distribution of values for every features matches in both datasets?

The following example, illustrate the interactive tool that TFDV provides for comparing two datasets:

dataset1_stats = tfdv.generate_statistics_from_csv(
  data_location='data/X_1.csv', delimiter=',')
dataset2_stats = tfdv.generate_statistics_from_csv(
  data_location='data/X_2.csv', delimiter=',')

tfdv.visualize_statistics(
  lhs_statistics=dataset1_stats, lhs_name='DS-I',
  rhs_statistics=dataset2_stats, rhs_name='DS-II')

In this example, we can easily see that the distribution of the two datasets is very different. In most cases, the same two follow a normal distribution but clearly with two different mean and standard deviation. We can also see that in the second dataset, the numerical feature X4 has a lot more missing values than it does in the first dataset. Also, the categorical feature X3 seem to have an additional label in the second dataset which is not present in the first dataset.

We can use the earlier generate schema for comparison and spot any mismatches present in the second dataset:

anomalies = tfdv.validate_statistics(statistics=dataset2_stats, schema=schema, previous_statistics=dataset1_stats)
anomalies

An example of the output of the anomaly report

anomaly_info {
  key: "X3"
  value {
    description: "Examples contain values missing from the schema: D (~13%). "
    severity: ERROR
    short_description: "Unexpected string values"
    reason {
      type: ENUM_TYPE_UNEXPECTED_STRING_VALUES
      short_description: "Unexpected string values"
      description: "Examples contain values missing from the schema: D (~13%). "
    }
    path {
      step: "X3"
    }
  }
}

We can display a summary of the anomalies using tfdv.display_anomalies

tfdv.display_anomalies(anomalies)

Feature name	Anomaly short description	Anomaly long description
‘X1’	Column dropped	The feature was present in fewer examples than expected.
‘X3’	Unexpected string values	Examples contain values missing from the schema: D (~13%).
‘X4’	Column dropped	The feature was present in fewer examples than expected.
‘X5’	Unexpected string values	Examples contain values missing from the schema: D (~16%).

We can see that the reported anomalies matches our earlier observations, for instance:

The presence of a new label D for the categorical feature X3
The higher rate of missing values for the feature X4

Comparing slices

In addition to comparing entire datasets, TFDV can also be used to compare slices of the same dataset on a particular feature. This is very useful when inspecting the data for bias when missing values are not uniformly spread over the different labels.

As an example, we will look at feature X3 from the first dataset, and slice this dataset to get the statistics for label B using the following snippet.

from tensorflow_data_validation.utils import slicing_util

# slice dataset on label B of feature X3
slice_fn1 = slicing_util.get_feature_value_slicer(features={'X3': [b'B']})
slice_options = tfdv.StatsOptions(slice_functions=[slice_fn1])
slice_stats = tfdv.generate_statistics_from_csv(
  data_location='data/X_1.csv',
  stats_options=slice_options)

# helper code for visualization
from tensorflow_metadata.proto.v0 import statistics_pb2

def display_slice_keys(stats):
  print(list(map(lambda x: x.name, slice_stats.datasets)))

def get_sliced_stats(stats, slice_key):
  for sliced_stats in stats.datasets:
    if sliced_stats.name == slice_key:
      result = statistics_pb2.DatasetFeatureStatisticsList()
      result.datasets.add().CopyFrom(sliced_stats)
      return result
    print(f'Invalid Slicing key: {slice_key}')

# Visualize both statistics
lhs_stats = get_sliced_stats(slice_stats, 'X3_B')
rhs_stats = get_sliced_stats(slice_stats, 'All Examples')
tfdv.visualize_statistics(lhs_stats, rhs_stats)

The resulting TFDV visualization of the slice vs all dataset will look like this:

Comparing Datasets for Skew

TFDV provides skew_comparator to inspect the statistics of two datasets and detects any significant differences. TFDV defines Skew as the L-infinity norm of the difference between the serving_statistics of two datasets. A threshold on the L-infinity norm is used for reporting an anomaly.

The following example, illustrates how to generate the Skew anomaly report then visualize it.

tfdv.get_feature(schema, 'X5').skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(
  statistics=dataset1_stats,
  schema=schema,
  serving_statistics=dataset2_stats)
skew_anomalies

After generating the Skew anomaly report, we can visualize it using display_anomalies

tfdv.display_anomalies(skew_anomalies)

Feature name	Anomaly short description	Anomaly long description
‘X5’	High Linfty distance between training and serving	The Linfty distance between training and serving is 0.1648 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: D

Comparing Datasets for Drift

TFDV also provides a drift_comparator for comparing the statistics of two datasets (e.g. collected at different months).

To use the drift_comparator as illustrated in the following snippet, simply pick a feature to analyze and use validate_statistics by supplying a baseline (e.g., last month dataset) and a comparison dataset (e.g., this month dataset).

tfdv.get_feature(schema, 'X3').drift_comparator.infinity_norm.threshold = 0.01
drift_anomalies = tfdv.validate_statistics(
  statistics = dataset2_stats,
  schema = schema,
  previous_statistics = dataset1_stats)
drift_anomalies

After generating the Drift anomaly report, we can visualize it using display_anomalies

tfdv.display_anomalies(skew_anomalies)

Feature name	Anomaly short description	Anomaly long description
‘X5’	High Linfty distance between training and serving	The Linfty distance between training and serving is 0.1648 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: D

Using TFDV with TFX

In the previous sections, we have seen how to use TFDV for exploring and validation datasets as a standalone tool. This is very handy for manual inspection of the data, but this can also be automated in TFX using the following components:

Generating statistics with StatisticsGen

For generating statistics, TFX provides the pipeline component StatisticsGen. This component accepts as input the output from the pipeline component ExampleGen components as input. It can be used as follows:

from tfx.components import StatisticsGen

statistics_gen = StatisticsGen(
  examples=example_gen.outputs['examples'])
context.run(statistics_gen)

When TFX pipeline is run in an interactive context, we can visualize the output statistics using Facets as follows:

context.show(statistics_gen.outputs['statistics'])

Generating Schema with SchemaGen

For generating schema for our data, TFX provides the SchemaGen component which takes as input the previously generting statistics by StatisticsGen.

from tfx.components import SchemaGen

schema_gen = SchemaGen(
  statistics=statistics_gen.outputs['statistics'],
  infer_feature_shape=True)
context.run(schema_gen)

Note: if a schema already exists in the metadata store, the SchemaGen will not generate a new one and in case it has to be updated (e.g. due to the presence of a new feature) you may have to update it manually as explained earlier.

Validating examples with ExampleValidator

For validating examples from a new datasets, TFX provides the ExampleValidator component which takes as input the output of the two previous component, i.e. the schema and statistics.

from tfx.components import ExampleValidator

example_validator = ExampleValidator(
  statistics=statistics_gen.outputs['statistics'],
  schema=schema_gen.outputs['schema'])
context.run(example_validator)

Note: in case of any anomalies detected by the ExampleValidator component (i.e. mismatches in statistics or schema) it will set the status of the pipeline in the metadata store to failed, which will eventually stop it. Otherwise, the pipeline will proceed, for instance to the data preprocessing step.

All things