Building models with tf.text

25 Dec 2019 by dzlab

The field NLP is going over a renaissance with spectacular advances in different tasks like search, Autocomplete, Translation, chatbots (see The Economist interview with a bot). Those achivements were made possible thanks to SOTA models like Google’s BERT and OpenAI’s GPT-2 and particularly Transfer Learning capabilities of those models. And thanks to tools like Tensorflow and Tensorflow Hub, it is becoming easier to build models for your task from pre-trained ones and achieve SOTA results.

An important step in building a model is Text preprocessing which consist of:

tokenization, i.e. extracting tokens from the original text
numerization of those tokens, i.e. given a vocabulary of unique tokens attribute an integer to each one.

In a nutshell, basic preprocessing consists of:

## Given an Input text
['Never tell me the odds!', "It's not my fault.", "It's a trap!"]

## Split sentence into tokens
[['Never', 'tell', 'me', 'the', 'odds!'], ["It's", 'not', 'my', 'fault.'], ["It's", 'a', 'trap!']]

## looked up in vocabulary for Token IDs
[[1, 3, 4, 10, 17], [16, 5, 7, 18], [16, 2, 11]]

This step is very imporant as it could influence dramatically the performance of the final model. It can be a tedious and error-prone step despite the availability of excellent tools like Spacy, NLTK, GenSim. In fact those same tools can lead to a Training / Serving Skew as the preprocessing will be performed outside tensorflow graph.

Training / Serving Skew is usually caused during model serving, as a result of the preprocessing is performed in a different language/library which may causes issues.

tf.text aims to make text a first-class citizen in Tensorflow by providing built-in support for text in Tensorflow:

In-graph text preprocessing for serving & training,
Text and sequential (e.g. timeseries) model APIs
RaggedTensors for better text representation

Thus no longer need for relying on the client for preprocessing during serving.

Tokenizer API

Based on RFC 98, the Tokenization API introduced two main classes:

Tokenizer: An abstract class with one method tokenize that takes strings or integer Tensor and outputs Tokens.
TokenizerWithOffsets: An abstract class with one method tokenize_with_offsets that returns in addition to the tokens the offsets from where they start and end.

For instance, when using WhitespaceTokenizer (which implements both APIs)

>>> (tokens, offset_starts, offset_limits) = tokenizer.tokenize_with_offsets(["I know you're out there.", "I can feel you now."])

(<tf.RaggedTensor [[b'I', b'know', b"you're", b'out', b'there.'], [b'I', b'can', b'feel', b'you', b'now.']]>,
 <tf.RaggedTensor [[0, 2, 7, 14, 18], [0, 2, 6, 11, 15]]>,
 <tf.RaggedTensor [[1, 6, 13, 17, 24], [1, 5, 10, 14, 19]]>)

Currently available tokenizers are:

Whitespace: splits the sentence on whitespaces
Unicode Script: splits on Unicode script boundaries (ICU), for english in addition to splitting on whitespaces it also splits on ponctuation.
Wordpiece: popularized with BERT, split tokens further using a subword vocabulary. This greatly reduces the size of the vocabulary. There is a pipeline you can use to generate your own vocab use BERT vocabb.
Sentencepiece: a popular tokenizer that splits bbased on model configuration (subword, word or character)
BERT: does have all the preprocessing that matches BERT model

RaggedTensors

RaggedTensors is a special Tensor that stores sequences (text or number) efficiently and does not require padding.

They can be created as follows:

>>> tf.ragged.constant([['Everything', 'not', 'saved', 'will', 'be', 'lost.'], ["It's", 'a', 'trap!']])

<tf.RaggedTensor [[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'], [b"It's", b'a', b'trap!']]>

>>> values = ['Everything', 'not', 'saved', 'will', 'be', 'lost.', "It's", 'a', 'trap!']
>>> row_splits = [0, 6, 9]
>>> tf.RaggedTensor.from_row_splits(values, row_splits)

<tf.RaggedTensor [[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'], [b"It's", b'a', b'trap!']]>

Another example of creating RaggedTensors with Row lengths

>>> tf.RaggedTensor.from_row_splits([3, 1, 4, 1, 5, 9, 2], [0, 4, 4, 6, 7])
>>> tf.RaggedTensor.from_value_rowids([3, 1, 4, 1, 5, 9, 2], [0, 0, 0, 0, 2, 2, 3])
>>> tf.RaggedTensor.from_row_lengths([3, 1, 4, 1, 5, 9, 2], [4, 0, 2, 1])

Ragged Tensors are regular tensors:

x = tf.ragged.constant([['a', 'b', 'c'], ['d']])

tf.rank(x) # 2
x.shape # [2, None] where ? denote the ragged dimension which is not always at the end

tf.gather(x, [1, 0]) # [['d'], ['a', 'b', 'c']]
tf.gather_nd(x, [1, 2]) # d

y = tf.ragged.constant([['e'], ['f', 'g'])
tf.concat([x, y], axis=0) # [['a', 'b', 'c'], ['d'], ['e'], ['f', 'g']]
tf.concat([x, y], axis=1) # [['a', 'b', 'c', 'e'], ['d', 'f', 'g']]

cp = tf.strings.unicode_decode(x, 'UTF-8') # [[[97], [98], [99]], [[100]]]
tf.strings.unicode_encode(cp, 'UTF-8')

b = tf.ragged.constant([[True, False, True], [False, True]])
x = tf.ragged.constant([['A', 'B', 'C'], ['D', 'E']])
y = tf.ragged.constant([['a', 'b', 'c'], ['d', 'e']])
tf.where(, x, y) # [['A', 'b', 'C'], ['d', 'E']]

They can be creacted from other forms

tf.RaggedTensor.from_tensor(x)
tf.RaggedTensor.from_sparse(x)

Also easily converted to other forms

x.to_tensor()
x.to_sparse()
x.to_list()

Currently, RaggedTensors are compatible with a handful of Keras layers:

Input
Embedding
Recurrent layers (SimpleRNN, GRU, LSTM, CuDNNGRU, CuDNNLSTM)
Bidirectional
TimeDistribbuted
Lambda
Global Pooling
Merge
tensorflow_text.ToDense

The tensorflow_text.ToDense layer can be used to convert a RaggedTensor into a regular Tensor in case your model has layers that does not support them. For instance:

model = tf.keras.Sequential([
  InputLayer(input_shape=(None,), dtype='int64', ragged=True),
  tensorflow_text.keras.layers.ToDense(pad_value=pad_val, mask=True),
  Lambda(lambda x:K.one_hot(K.cast(x,'int64'), vocab_size)),
  LSTM(lstm_output_1),
  Dense(vocab_size, activation='softmax')
])
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Tensorflow Text

tf.text can be install using pip as follows

pip install tensorflow_text

Then imported as follows

import tensorflow as tf
import tensorflow_text as text

Preprocessing

Using Tensorflow Text, a preprocessing function should look like this

def basic_preprocess(text_input, labels):
  # Tokenize and encode the text
  tokenizer = text.WhitespaceTokenizer()
  rt = tokenizer.tokenize(text_input)

  # Lookup token strings in vocabulary
  features = tf.ragged.map_flat_values(table.lookup, rt)

  return features, labels

Create a dataset for training and pass each sample to the previous preprocessing function

## Set up a data pipeline to preprocess the input
dataset = tf.data.Dataset.from_tensor_slices((text_input, labels))
dataset = dataset.map(basic_preprocess)

## Create a model to classify the sentence sentiment
model = keras.Sequential([.., keras.layers.LSTM(32), ...])
model.compile(...)

## Train the classifier on the input data
model.fit(dataset, epochs=30)

Ngrams

Tensorflow Text provides an API to create Ngrams (i.e. a grouping of a fixed size over a series), which for instance can be used in a Character bigram model.

Group Reductions

Available group reductions (for different grouping ways): STRING_JOIN, SUM, MEAN

For instance, the MEAN reduction can be interesting for integer for instance if they represent temperature readings which a are gouped in 2 or 3 readings.

STRING_JOIN bigram example

t = tf.constant(["Fate, it seems, is not without a sense of irony."])
tokens = tokenizer.tokenize(t)
# [['Fate,', 'it', 'seems', 'is', 'not', 'without', 'a', 'sense', 'of', 'irony.']]

text.ngrams(tokens, width=2, reduction_type=text.Reduction.STRING_JOIN)
# [['Fate, it', 'it seems,', 'seems, is', 'is not', 'not without', 'without a', 'a sense', 'sense of', 'of irony.']]

STRING_JOIN trigram example

t = tf.constant(["It's a trap!"])
chars = tf.strings.unicode_split(t, 'UTF-8') # [['I', 't', "'", 's', ' ', 'a', ' ', 't', 'r', 'a', 'p', '!']]
text.ngrams(chars, width=3, reduction_type=text.Reduction.STRING_JOIN, string_separator='')
# [["It'", "t's", "'s ", 's a', ' a ', 'a t', ' tr', 'tra', 'rap', 'ap!']]

Numeric (SUM, MEAN) bigram examples

t = tf.constant([2, 4, 6, 8, 10])
text.ngrams(t, 2, reduction_type=text.Reduction.SUM)  # [6 10 14 18]
text.ngrams(t, 2, reduction_type=text.Reduction.MEAN) # [3 5 7 9]

Preprocessing with ngrams

The ngrams API can be used in a preprocessing function like this

def preprocess(record):
  # ['Looks good.', 'Thanks!', 'Okay'] shape (3)
  # Convert characters into codepoints
  codepoints = tf.strings.unicode_decode(record['raw'], 'UTF-8')
  # [[76, 111, 111, 107, 115, 32, 103, 111, 111, 100, 46], [84, 104, 97, 110, 107, 115, 33], [79, 107, 97, 121]] shape(3, ?)
  codepoints = codepoints.merge_dims(outer_axis=-2, inner_axis=-1)

  # Generate bigrams
  bigrams = text.ngrams(codepoints, 2, reduction_type=text.Reduction.SUM)
  values = tf.cast(bigrams, dtype=tf.float32)
  labels = record.pop('attack')

  return values, labels

This preprocessing function can then be is used in a pipeline to generate training dataset for the model training

# Set up a data pipeline to preprocess the input
dataset = tf.data.TFRecordDataset('.../*.tfrecord')
dataset = dataset.map(preprocess)

## Create a model to classify the sentence sentiment
model = keras.Sequential([..., keras.layers.LSTM(32), ...])
model.compile(...)

## Train the classifier on the input data
model.fit(dataset, epochs=30)

All things

Building models with tf.text

Tokenizer API

RaggedTensors

Tensorflow Text

Preprocessing

Ngrams

Group Reductions

Preprocessing with ngrams

Related Posts

Setting Up Elastic-based logging stack with Docker Compose 21 May 2024

ML Pipelines on GCP with Vertex AI 20 Nov 2023

Extracting structured data from unstructured text with PaLM 13 Nov 2023