How to use TextVectorization layer

pip install tf-nightly -q

The release of TF 2.1 introduced many new features like the introduction of TextVectorization. This layer performs preprocessing of raw text: text normalization/standardization, tokenization, n-gram generation, and vocabulary indexing.

This class takes the following arguments:

Argument	Default	Description
`max_tokens`	`None` (no limit)	Maximum size of the vocabulary
`standardize`	`lower_and_strip_punctuation`	Function to call for text standardization, it can be None (no standardization).
`split`	`whitespace`	Function to use for splitting.
`ngrams`	`None`	Integer or tuple representing how many ngrams to create
`output_mode`	`int`	output of the layer, `int`: for token indices, `binary`: to output an array of 1s where each 1 means the token is available in the text. `count`: similary to `binary` except instead of 1s the output array will contain token count. `tf-idf` similar to `binary` except the values are calculated with the TF-IDF algorithm.
`output_sequence_length`	`None`	Valid for `int` mode, it will be used to pad the text up to this length.
`pad_to_max_tokens`	`True`	Valid for `binary`, `count`, and `tf-idf` modes. A flag idicating whether or not to pad output up to `max_tokens`.

HowTo

First, look at the raw data (in training set) to figure out the type of normalization and tokenization needed as well as checking they are producing expected result.

Second, define a function that will get as input raw text and clean it, e.g. punctuations and any contain HTML tags.

def normlize(text):
  remove_regex = f'[{re.escape(string.punctuation)}]'
  space_regex = '...'
  result = tf.strings.lower(text)
  result = tf.strings.regex_replace(result, remove_regex, '')
  result = tf.strings.regex_replace(result, space_regex, ' ')
  return result

Third, define a TextVectorization layer that will take the previously defined normalize function as well as define the shape of the output.

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorize_layer = TextVectorization(
  standardize=normlize,
  max_tokens=MAX_TOKENS_NUM,
  output_mode='int',
  output_sequence_length=MAX_SEQUENCE_LEN)

Forth, call the vectorization layer adapt method to build the vocabulry.

vectorize_layer.adapt(text_dataset)

Finally, the layer can be used in a Keras model just like any other layer.

MAX_TOKENS_NUM = 5000  # Maximum vocab size.
MAX_SEQUENCE_LEN = 40  # Sequence length to pad the outputs to.
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))

One here that the input layer needs to have a shape of (1,) so that we have one string per item in a batch. Also, the embedding layer takes an input of MAX_TOKENS_NUM+1 because we are counting the padding token.

Check TF 2.1.0 release note here.

HowTo

Related tips

Track your TF model GPU memory consumption during training

English Text to speech with TensorFlowTTS

Improve read performance with TFRecordDataset