Text Preprocessing with Keras

By @dzlab on Jan 10, 2020

Preprocessing can be very tedious depending on the data format (e.g. json, xml, binary) and how your model is expecting it (e.g. a fixed sequence length). Keras provides an API for preprocessing different kind of raw data Image or Text that’s very important to know about.

Sequence Preprocessing

The keras.preprocessing package have a sequence processing helpers for sequence data preprocessing, either text data or timeseries.

Sequence Padding

You can use pad_sequences to add padding to your data so that the result would have same format.

from keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train_raw, maxlen=80)
X_test  = pad_sequences(X_test_raw, maxlen=80)

Skip Grams

You can use skipgrams to generate skipgram word pairs.

Sampling

You can use make_sampling_table to enerate word rank-based probabilistic sampling table.

Text Preprocessing

The Keras package keras.preprocessing.text provides many tools specific for text processing with a main class Tokenizer. In addition, it has following utilities:

one_hot to one-hot encode text to word indices
hashing_trick to converts a text to a sequence of indexes in a fixed- size hashing space

Tokenization

Use fit_on_texts to update the tokenizer internal vocabulary based on a list of texts.
Use fit_on_sequences to update the tokenizer internal vocabulary based on a list of sequences.

Numericalization

Use texts_to_sequences to transforms each string in a list of strings to sequence of integers
Use sequences_to_matrix to convert a list of sequences into a Numpy matrix

One-Hot Encoding

The keras.utils package have processing helpers for categorical embedding. Example you can use to_categorical to transform an integer represening a class into a sparse vector with zeros everywhere but the index of the class.

from keras.utils import to_categorical

num_classes = 10

Y_train = to_categorical(y_train_raw, num_classes)
Y_test = to_categorical(y_test_raw, num_classes)

Examples

Example 1: dealing with already pre-processed text

from keras.datasets import reuters
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

(X_train, y_train)= reuters.load_data(num_words=NUM_WORDS)

tokenizer = Tokenizer(num_words=NUM_WORDS)
X_train = tokenizer.sequences_to_matrix(X_train, mode='binary')
y_train = to_categorical(y_train, NUM_CLASSES)

Example 2: vectorizing raw text into a 2D integer tensor

from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

texts  = []  # list of text samples
labels = []  # list of label ids

tokenizer = Tokenizer(num_words=NUM_WORDS)

tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
X_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
y_train = to_categorical(np.asarray(labels))