Building models with tf.text

The field NLP is going over a renaissance with spectacular advances in different tasks like search, Autocomplete, Translation, chatbots (see The Economist interview with a bot). Those achivements were made possible thanks to SOTA models like Google’s BERT and OpenAI’s GPT-2 and particularly Transfer Learning capabilities of those models. And thanks to tools like Tensorflow and Tensorflow Hub, it is becoming easier to build models for your task from pre-trained ones and achieve SOTA results.

An important step in building a model is Text preprocessing which consist of:

In a nutshell, basic preprocessing consists of:

## Given an Input text
['Never tell me the odds!', "It's not my fault.", "It's a trap!"]

## Split sentence into tokens
[['Never', 'tell', 'me', 'the', 'odds!'], ["It's", 'not', 'my', 'fault.'], ["It's", 'a', 'trap!']]

## looked up in vocabulary for Token IDs
[[1, 3, 4, 10, 17], [16, 5, 7, 18], [16, 2, 11]]

This step is very imporant as it could influence dramatically the performance of the final model. It can be a tedious and error-prone step despite the availability of excellent tools like Spacy, NLTK, GenSim. In fact those same tools can lead to a Training / Serving Skew as the preprocessing will be performed outside tensorflow graph.

Training / Serving Skew is usually caused during model serving, as a result of the preprocessing is performed in a different language/library which may causes issues.

tf.text aims to make text a first-class citizen in Tensorflow by providing built-in support for text in Tensorflow:

Thus no longer need for relying on the client for preprocessing during serving.

Tokenizer API

Based on RFC 98, the Tokenization API introduced two main classes:

For instance, when using WhitespaceTokenizer (which implements both APIs)

>>> (tokens, offset_starts, offset_limits) = tokenizer.tokenize_with_offsets(["I know you're out there.", "I can feel you now."])

(<tf.RaggedTensor [[b'I', b'know', b"you're", b'out', b'there.'], [b'I', b'can', b'feel', b'you', b'now.']]>,
 <tf.RaggedTensor [[0, 2, 7, 14, 18], [0, 2, 6, 11, 15]]>,
 <tf.RaggedTensor [[1, 6, 13, 17, 24], [1, 5, 10, 14, 19]]>)

Currently available tokenizers are:

RaggedTensors

RaggedTensors is a special Tensor that stores sequences (text or number) efficiently and does not require padding.

They can be created as follows:

>>> tf.ragged.constant([['Everything', 'not', 'saved', 'will', 'be', 'lost.'], ["It's", 'a', 'trap!']])

<tf.RaggedTensor [[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'], [b"It's", b'a', b'trap!']]>

>>> values = ['Everything', 'not', 'saved', 'will', 'be', 'lost.', "It's", 'a', 'trap!']
>>> row_splits = [0, 6, 9]
>>> tf.RaggedTensor.from_row_splits(values, row_splits)

<tf.RaggedTensor [[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'], [b"It's", b'a', b'trap!']]>

Another example of creating RaggedTensors with Row lengths

>>> tf.RaggedTensor.from_row_splits([3, 1, 4, 1, 5, 9, 2], [0, 4, 4, 6, 7])
>>> tf.RaggedTensor.from_value_rowids([3, 1, 4, 1, 5, 9, 2], [0, 0, 0, 0, 2, 2, 3])
>>> tf.RaggedTensor.from_row_lengths([3, 1, 4, 1, 5, 9, 2], [4, 0, 2, 1])

Ragged Tensors are regular tensors:

x = tf.ragged.constant([['a', 'b', 'c'], ['d']])

tf.rank(x) # 2
x.shape # [2, None] where ? denote the ragged dimension which is not always at the end

tf.gather(x, [1, 0]) # [['d'], ['a', 'b', 'c']]
tf.gather_nd(x, [1, 2]) # d

y = tf.ragged.constant([['e'], ['f', 'g'])
tf.concat([x, y], axis=0) # [['a', 'b', 'c'], ['d'], ['e'], ['f', 'g']]
tf.concat([x, y], axis=1) # [['a', 'b', 'c', 'e'], ['d', 'f', 'g']]

cp = tf.strings.unicode_decode(x, 'UTF-8') # [[[97], [98], [99]], [[100]]]
tf.strings.unicode_encode(cp, 'UTF-8')

b = tf.ragged.constant([[True, False, True], [False, True]])
x = tf.ragged.constant([['A', 'B', 'C'], ['D', 'E']])
y = tf.ragged.constant([['a', 'b', 'c'], ['d', 'e']])
tf.where(, x, y) # [['A', 'b', 'C'], ['d', 'E']]

They can be creacted from other forms

tf.RaggedTensor.from_tensor(x)
tf.RaggedTensor.from_sparse(x)

Also easily converted to other forms

x.to_tensor()
x.to_sparse()
x.to_list()

Currently, RaggedTensors are compatible with a handful of Keras layers:

The tensorflow_text.ToDense layer can be used to convert a RaggedTensor into a regular Tensor in case your model has layers that does not support them. For instance:

model = tf.keras.Sequential([
  InputLayer(input_shape=(None,), dtype='int64', ragged=True),
  tensorflow_text.keras.layers.ToDense(pad_value=pad_val, mask=True),
  Lambda(lambda x:K.one_hot(K.cast(x,'int64'), vocab_size)),
  LSTM(lstm_output_1),
  Dense(vocab_size, activation='softmax')
])
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Tensorflow Text

tf.text can be install using pip as follows

pip install tensorflow_text

Then imported as follows

import tensorflow as tf
import tensorflow_text as text

Preprocessing

Using Tensorflow Text, a preprocessing function should look like this

def basic_preprocess(text_input, labels):
  # Tokenize and encode the text
  tokenizer = text.WhitespaceTokenizer()
  rt = tokenizer.tokenize(text_input)

  # Lookup token strings in vocabulary
  features = tf.ragged.map_flat_values(table.lookup, rt)

  return features, labels

Create a dataset for training and pass each sample to the previous preprocessing function

## Set up a data pipeline to preprocess the input
dataset = tf.data.Dataset.from_tensor_slices((text_input, labels))
dataset = dataset.map(basic_preprocess)

## Create a model to classify the sentence sentiment
model = keras.Sequential([.., keras.layers.LSTM(32), ...])
model.compile(...)

## Train the classifier on the input data
model.fit(dataset, epochs=30)

Ngrams

Tensorflow Text provides an API to create Ngrams (i.e. a grouping of a fixed size over a series), which for instance can be used in a Character bigram model.

Group Reductions

Available group reductions (for different grouping ways): STRING_JOIN, SUM, MEAN

For instance, the MEAN reduction can be interesting for integer for instance if they represent temperature readings which a are gouped in 2 or 3 readings.

t = tf.constant(["Fate, it seems, is not without a sense of irony."])
tokens = tokenizer.tokenize(t)
# [['Fate,', 'it', 'seems', 'is', 'not', 'without', 'a', 'sense', 'of', 'irony.']]

text.ngrams(tokens, width=2, reduction_type=text.Reduction.STRING_JOIN)
# [['Fate, it', 'it seems,', 'seems, is', 'is not', 'not without', 'without a', 'a sense', 'sense of', 'of irony.']]
t = tf.constant(["It's a trap!"])
chars = tf.strings.unicode_split(t, 'UTF-8') # [['I', 't', "'", 's', ' ', 'a', ' ', 't', 'r', 'a', 'p', '!']]
text.ngrams(chars, width=3, reduction_type=text.Reduction.STRING_JOIN, string_separator='')
# [["It'", "t's", "'s ", 's a', ' a ', 'a t', ' tr', 'tra', 'rap', 'ap!']]
t = tf.constant([2, 4, 6, 8, 10])
text.ngrams(t, 2, reduction_type=text.Reduction.SUM)  # [6 10 14 18]
text.ngrams(t, 2, reduction_type=text.Reduction.MEAN) # [3 5 7 9]

Preprocessing with ngrams

The ngrams API can be used in a preprocessing function like this

def preprocess(record):
  # ['Looks good.', 'Thanks!', 'Okay'] shape (3)
  # Convert characters into codepoints
  codepoints = tf.strings.unicode_decode(record['raw'], 'UTF-8')
  # [[76, 111, 111, 107, 115, 32, 103, 111, 111, 100, 46], [84, 104, 97, 110, 107, 115, 33], [79, 107, 97, 121]] shape(3, ?)
  codepoints = codepoints.merge_dims(outer_axis=-2, inner_axis=-1)

  # Generate bigrams
  bigrams = text.ngrams(codepoints, 2, reduction_type=text.Reduction.SUM)
  values = tf.cast(bigrams, dtype=tf.float32)
  labels = record.pop('attack')

  return values, labels

This preprocessing function can then be is used in a pipeline to generate training dataset for the model training

# Set up a data pipeline to preprocess the input
dataset = tf.data.TFRecordDataset('.../*.tfrecord')
dataset = dataset.map(preprocess)

## Create a model to classify the sentence sentiment
model = keras.Sequential([..., keras.layers.LSTM(32), ...])
model.compile(...)

## Train the classifier on the input data
model.fit(dataset, epochs=30)