Data Validation with TensorFlow eXtended (TFX)
03 Nov 2020 by dzlabIn a previous article, we discussed the we can ingest data from various sources into a TFX pipeline. In this article, we will discuss the next step of a TFX pipeline which involves schema generation and data validation.
This step checks the data coming through the pipeline, and catches any changes that could impact the next steps (i.e. feature engineering and training). In TFX, this is implemented via the Tensorflow Data Validation (TFDV) Python library which can be installed with pip
.
$ pip install tensorflow-data-validation
TFDV can be used for generating schemas and statistics about the distribution of every feature in the dataset. Such information is useful for comparing multiple datasets (e.g. training vs inference datasets) and reporting:
- Anomalies related to schema changes
- Statistical differences in the features distribution
TFDV also offers visualization capabilities for comparing datasets based on the Google PAIR Facets project.
Describing data with TFDV
The usual workflow when using TFDV during training is as follows:
- Generate statistics for the data
- Use those statistics to generate a schema for each feature
- Visualize the schema and statistics and manually inspect them
- Update the schema if needed
Then when new data comes in, the workflow becomes:
- Generate statistics and schema for the new data.
- Visualize side by side the statistics to the training data statistics.
- Validate the statistics against the one from training to detect anomalies.
Generating Statistics
Before any data validation we need to generate statistics, we can use any of TFDV helper functions:
tfdv.generate_statistics_from_csv
when data is in a CSV filetfdv.generate_statistics_from_dataframe
when data is in a Pandas DataFrametfdv.generate_statistics_from_tfrecord
when data is in a TFRecord file
Here is an example reading from a CSV file with TFDV and generating statistics for each feature.
stats = tfdv.generate_statistics_from_csv(data_location='data/train.csv', delimiter=',')
We can manually inspect those statistics using tfdv.visualize_statistics
tfdv.visualize_statistics(stats)
You can notice that TFDV generates different types of statistics based on the type of features.
- For numerical features, TFDV computes for every feature:
- Count of records
- Number of missing (i.e. null values)
- Histogram of values
- Mean and standard deviation
- Minimum and maximum values
- Percentage of zero values
- For categorical features, TFDV provides:
- Count of values
- Percentage of missing values
- Number of unique values
- Average string length
- Count for each label and its rank
Generating Schema
Once statistics are generated, the next step is to generate a schema for our dataset. This schema will map each feature in the dataset to a type (float, bytes, etc.). Also define feature boundaries (min, max, distribution of values and missings, etc.).
With TFDV, we generate schema from statistincs using tfdv.infer_schema
as follows:
schema = tfdv.infer_schema(stats)
For example a numercial may have a schema like this:
feature {
name: "Num"
type: FLOAT
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
On the other hand, a categorical feature may have a schema that looks like this:
feature {
name: "Cat"
type: BYTES
domain: "Cat"
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
string_domain {
name: "Cat"
value: "A"
value: "B"
value: "C"
}
TFDV provides a API to print a summary of each feature schema using tfdv.display_schema
tfdv.display_schema(schema)
Feature name | Type | Presence | Valency | Domain |
---|---|---|---|---|
‘Num’ | FLOAT | required | - | |
‘Cat’ | STRING | required | ‘Cat’ |
Domain | |
---|---|
‘Cat’ | ‘A’, ‘B’, ‘C’ |
In this visualization, the columns stand for:
- Presence indicates whether the feature must be present in 100% of examples (required) or not (optional).
- Valency indicates the number of values required per training example. In the case of categorical features, single indicates that each training example must have exactly one category for the feature.
Updating schema
TFDV lets you update the schema according to your domain knowledge of the data if you are not satisfied by the auto-generated schema.
The steps to update the schema are as follows:
First, load the schema from its serialized location:
original_schema = tfdv.load_schema_text(schema_location)
Then, we update this feature, for instance so that it is required in 80% of cases instead of 100%:
Num_feature = tfdv.get_feature(schema, 'Num')
Num_feature.presence.min_fraction = 0.8
Another case, would be if a categorical feature was missing possible value. We can add this new label as follows:
Colors_domain = tfdv.get_domain(schema, 'Colors')
Colors_domain.value.insert(3, 'Yellow')
Colors_domain.value
# ['Red', 'Green', 'Blue', 'Yellow']
Finally, we can store the schema back as follows:
tfdv.write_schema_text(schema, schema_location)
Spotting issues with TFDV
In the previous sections, we introduced the basics of TFDV and how to generate statistics and a schema for a dataset. In this section, we will see how to use TFDV to spot issues in the data.
Comparing Datasets
Suppose we have two datasets one for training the over for evaluation. TFDV lets you determine how representative is the evaluation dataset of to the training one. More precisely, it helps you answer questions like:
- Does the evaluation data have a similar schema as the training dataset?
- Does the distribution of values for every features matches in both datasets?
The following example, illustrate the interactive tool that TFDV provides for comparing two datasets:
dataset1_stats = tfdv.generate_statistics_from_csv(
data_location='data/X_1.csv', delimiter=',')
dataset2_stats = tfdv.generate_statistics_from_csv(
data_location='data/X_2.csv', delimiter=',')
tfdv.visualize_statistics(
lhs_statistics=dataset1_stats, lhs_name='DS-I',
rhs_statistics=dataset2_stats, rhs_name='DS-II')
In this example, we can easily see that the distribution of the two datasets is very different. In most cases, the same two follow a normal distribution but clearly with two different mean and standard deviation. We can also see that in the second dataset, the numerical feature X4 has a lot more missing values than it does in the first dataset. Also, the categorical feature X3 seem to have an additional label in the second dataset which is not present in the first dataset.
We can use the earlier generate schema for comparison and spot any mismatches present in the second dataset:
anomalies = tfdv.validate_statistics(statistics=dataset2_stats, schema=schema, previous_statistics=dataset1_stats)
anomalies
An example of the output of the anomaly report
anomaly_info {
key: "X3"
value {
description: "Examples contain values missing from the schema: D (~13%). "
severity: ERROR
short_description: "Unexpected string values"
reason {
type: ENUM_TYPE_UNEXPECTED_STRING_VALUES
short_description: "Unexpected string values"
description: "Examples contain values missing from the schema: D (~13%). "
}
path {
step: "X3"
}
}
}
We can display a summary of the anomalies using tfdv.display_anomalies
tfdv.display_anomalies(anomalies)
Feature name | Anomaly short description | Anomaly long description |
---|---|---|
‘X1’ | Column dropped | The feature was present in fewer examples than expected. |
‘X3’ | Unexpected string values | Examples contain values missing from the schema: D (~13%). |
‘X4’ | Column dropped | The feature was present in fewer examples than expected. |
‘X5’ | Unexpected string values | Examples contain values missing from the schema: D (~16%). |
We can see that the reported anomalies matches our earlier observations, for instance:
- The presence of a new label D for the categorical feature X3
- The higher rate of missing values for the feature X4
Comparing slices
In addition to comparing entire datasets, TFDV can also be used to compare slices of the same dataset on a particular feature. This is very useful when inspecting the data for bias when missing values are not uniformly spread over the different labels.
As an example, we will look at feature X3 from the first dataset, and slice this dataset to get the statistics for label B using the following snippet.
from tensorflow_data_validation.utils import slicing_util
# slice dataset on label B of feature X3
slice_fn1 = slicing_util.get_feature_value_slicer(features={'X3': [b'B']})
slice_options = tfdv.StatsOptions(slice_functions=[slice_fn1])
slice_stats = tfdv.generate_statistics_from_csv(
data_location='data/X_1.csv',
stats_options=slice_options)
# helper code for visualization
from tensorflow_metadata.proto.v0 import statistics_pb2
def display_slice_keys(stats):
print(list(map(lambda x: x.name, slice_stats.datasets)))
def get_sliced_stats(stats, slice_key):
for sliced_stats in stats.datasets:
if sliced_stats.name == slice_key:
result = statistics_pb2.DatasetFeatureStatisticsList()
result.datasets.add().CopyFrom(sliced_stats)
return result
print(f'Invalid Slicing key: {slice_key}')
# Visualize both statistics
lhs_stats = get_sliced_stats(slice_stats, 'X3_B')
rhs_stats = get_sliced_stats(slice_stats, 'All Examples')
tfdv.visualize_statistics(lhs_stats, rhs_stats)
The resulting TFDV visualization of the slice vs all dataset will look like this:
Comparing Datasets for Skew
TFDV provides skew_comparator
to inspect the statistics of two datasets and detects any significant differences. TFDV defines Skew as the L-infinity norm of the difference between the serving_statistics
of two datasets. A threshold on the L-infinity norm is used for reporting an anomaly.
The following example, illustrates how to generate the Skew anomaly report then visualize it.
tfdv.get_feature(schema, 'X5').skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(
statistics=dataset1_stats,
schema=schema,
serving_statistics=dataset2_stats)
skew_anomalies
After generating the Skew anomaly report, we can visualize it using display_anomalies
tfdv.display_anomalies(skew_anomalies)
Feature name | Anomaly short description | Anomaly long description |
---|---|---|
‘X5’ | High Linfty distance between training and serving | The Linfty distance between training and serving is 0.1648 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: D |
Comparing Datasets for Drift
TFDV also provides a drift_comparator
for comparing the statistics of two datasets (e.g. collected at different months).
To use the drift_comparator
as illustrated in the following snippet, simply pick a feature to analyze and use validate_statistics
by supplying a baseline (e.g., last month dataset) and a comparison dataset (e.g., this month dataset).
tfdv.get_feature(schema, 'X3').drift_comparator.infinity_norm.threshold = 0.01
drift_anomalies = tfdv.validate_statistics(
statistics = dataset2_stats,
schema = schema,
previous_statistics = dataset1_stats)
drift_anomalies
After generating the Drift anomaly report, we can visualize it using display_anomalies
tfdv.display_anomalies(skew_anomalies)
Feature name | Anomaly short description | Anomaly long description |
---|---|---|
‘X5’ | High Linfty distance between training and serving | The Linfty distance between training and serving is 0.1648 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: D |
Using TFDV with TFX
In the previous sections, we have seen how to use TFDV for exploring and validation datasets as a standalone tool. This is very handy for manual inspection of the data, but this can also be automated in TFX using the following components:
Generating statistics with StatisticsGen
For generating statistics, TFX provides the pipeline component StatisticsGen
. This component accepts as input the output from the pipeline component ExampleGen
components as input. It can be used as follows:
from tfx.components import StatisticsGen
statistics_gen = StatisticsGen(
examples=example_gen.outputs['examples'])
context.run(statistics_gen)
When TFX pipeline is run in an interactive context, we can visualize the output statistics using Facets as follows:
context.show(statistics_gen.outputs['statistics'])
Generating Schema with SchemaGen
For generating schema for our data, TFX provides the SchemaGen
component which takes as input the previously generting statistics by StatisticsGen
.
from tfx.components import SchemaGen
schema_gen = SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True)
context.run(schema_gen)
Note: if a schema already exists in the metadata store, the SchemaGen will not generate a new one and in case it has to be updated (e.g. due to the presence of a new feature) you may have to update it manually as explained earlier.
Validating examples with ExampleValidator
For validating examples from a new datasets, TFX provides the ExampleValidator
component which takes as input the output of the two previous component, i.e. the schema and statistics.
from tfx.components import ExampleValidator
example_validator = ExampleValidator(
statistics=statistics_gen.outputs['statistics'],
schema=schema_gen.outputs['schema'])
context.run(example_validator)
Note: in case of any anomalies detected by the ExampleValidator
component (i.e. mismatches in statistics or schema) it will set the status of the pipeline in the metadata store to failed, which will eventually stop it. Otherwise, the pipeline will proceed, for instance to the data preprocessing step.