Setup

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass
TensorFlow 2.x selected.

Install dependencies

%%capture
%%bash

pip install -U tensorflow-text

Import modules

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub
import tensorflow_datasets as tfds

from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.models import Model

Set default options for modules

pd.set_option('display.max_colwidth', -1)

GPU check

num_gpus_available = len(tf.config.experimental.list_physical_devices('GPU'))
print("Num GPUs Available: ", num_gpus_available)
assert num_gpus_available > 0
Num GPUs Available:  1
config = {
  'seed': 31,
  'batch_size': 64,
  'epochs': 10,
  'max_seq_len': 128
}

Data

Download the pretrained BERT model

BERT_URL = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1"
bert_layer = hub.KerasLayer(BERT_URL, trainable=False)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
print(f'BERT vocab is stored at     : {vocab_file}')
print(f'BERT model is case sensitive: {do_lower_case}')
BERT vocab is stored at     : b'/tmp/tfhub_modules/03d6fb3ce1605ad9e5e9ed5346b2fb9623ef4d3d/assets/vocab.txt'
BERT model is case sensitive: True

Load the vocab file that corresponds to the pretrained BERT

def load_vocab(vocab_file):
  """Load a vocabulary file into a list."""
  vocab = []
  with tf.io.gfile.GFile(vocab_file, "r") as reader:
    while True:
      token = reader.readline()
      if not token: break
      token = token.strip()
      vocab.append(token)
  return vocab

vocab = load_vocab(vocab_file)

Use BERT vocab to create a word to index lookup table

def create_vocab_table(vocab, num_oov=1):
  """Create a lookup table for a vocabulary"""
  vocab_values = tf.range(tf.size(vocab, out_type=tf.int64), dtype=tf.int64)
  init = tf.lookup.KeyValueTensorInitializer(keys=vocab, values=vocab_values, key_dtype=tf.string, value_dtype=tf.int64)
  vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov, lookup_key_dtype=tf.string)
  return vocab_table

vocab_lookup_table = create_vocab_table(vocab)

Use BERT vocab to create a index to word lookup table

def create_index2word(vocab):
  # Create a lookup table for a index to token
  vocab_values = tf.range(tf.size(vocab, out_type=tf.int64), dtype=tf.int64)
  init = tf.lookup.KeyValueTensorInitializer(keys=vocab_values, values=vocab)
  return tf.lookup.StaticHashTable(initializer=init, default_value=tf.constant('unk'), name="index2word")

index2word = create_index2word(vocab)

Check out the indices for the following tokens

vocab_lookup_table.lookup(tf.constant(['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']))
<tf.Tensor: shape=(5,), dtype=int64, numpy=array([  0, 100, 101, 102, 103])>

Check out the token corresponding to an index

index2word.lookup(tf.constant([0], dtype='int64')).numpy()
[b'[PAD]']

Create a BERT tokenizer using TF Text

tokenizer = text.BertTokenizer(
    vocab_lookup_table,
    token_out_type=tf.int64,
    lower_case=do_lower_case
  )

Lookup for the BERT token IDs for padding and start/end of sentence.

PAD_ID = vocab_lookup_table.lookup(tf.constant('[PAD]')) # padding token
CLS_ID = vocab_lookup_table.lookup(tf.constant('[CLS]')) # class token
SEP_ID = vocab_lookup_table.lookup(tf.constant('[SEP]')) # sequence separator token

Preprocessing

Define the logic to preprocess data and format it as required by BERT

def preprocess(record):
  review, label = record['text'], record['label']
  # process review to calculate BERT input
  ids, mask, type_ids = preprocess_bert_input(review)
  return (ids, mask, type_ids), label

def preprocess_bert_input(review):
  # calculate tokens ID
  ids = tokenize_text(review, config['max_seq_len'])
  # calculate mask
  mask = tf.cast(ids > 0, tf.int64)
  mask = tf.reshape(mask, [-1, config['max_seq_len']])
  # calculate tokens type ID
  zeros_dims = tf.stack(tf.shape(mask))
  type_ids = tf.fill(zeros_dims, 0)
  type_ids = tf.cast(type_ids, tf.int64)

  return (ids, mask, type_ids)

def tokenize_text(review, seq_len):
  # convert text into token ids
  tokens = tokenizer.tokenize(review)
  # flatten the output ragged tensors
  tokens = tokens.merge_dims(1, 2)[:, :seq_len]
  # Add start and end token ids to the id sequence
  start_tokens = tf.fill([tf.shape(review)[0], 1], CLS_ID)
  end_tokens = tf.fill([tf.shape(review)[0], 1], SEP_ID)
  tokens = tokens[:, :seq_len - 2]
  tokens = tf.concat([start_tokens, tokens, end_tokens], axis=1)
  # truncate sequences greater than MAX_SEQ_LEN
  tokens = tokens[:, :seq_len]
  # pad shorter sequences with the pad token id
  tokens = tokens.to_tensor(default_value=PAD_ID)
  pad = seq_len - tf.shape(tokens)[1]
  tokens = tf.pad(tokens, [[0, 0], [0, pad]], constant_values=PAD_ID)

  # and finally reshape the word token ids to fit the output 
  # data structure of TFT 
  return tf.reshape(tokens, [-1, seq_len])

Dataset

Download the dataset from TF Hub and process it

train_ds, valid_ds = tfds.load('imdb_reviews', split=['train', 'test'], shuffle_files=True)

train_ds = train_ds.shuffle(1024).batch(config['batch_size']).prefetch(tf.data.experimental.AUTOTUNE)
valid_ds = valid_ds.shuffle(1024).batch(config['batch_size']).prefetch(tf.data.experimental.AUTOTUNE)

train_ds, valid_ds = train_ds.map(preprocess), valid_ds.map(preprocess)

Model

input_ids = Input(shape=(config['max_seq_len'],), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(config['max_seq_len'],), dtype=tf.int32, name="input_mask")
input_type_ids = Input(shape=(config['max_seq_len'],), dtype=tf.int32, name="input_type_ids")

pooled_output, sequence_output = bert_layer([input_ids, input_mask, input_type_ids])
drop_out = Dropout(0.3, name="dropout")(pooled_output)
output = Dense(1, activation='sigmoid', name="linear")(drop_out)

model = Model(inputs=[input_ids, input_mask, input_type_ids], outputs=output)
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_ids[0][0]                  
                                                                 input_mask[0][0]                 
                                                                 input_type_ids[0][0]             
__________________________________________________________________________________________________
dropout (Dropout)               (None, 768)          0           keras_layer[1][0]                
__________________________________________________________________________________________________
linear (Dense)                  (None, 1)            769         dropout[0][0]                    
==================================================================================================
Total params: 109,483,010
Trainable params: 769
Non-trainable params: 109,482,241
__________________________________________________________________________________________________

Training

model.fit(train_ds, validation_data=valid_ds, epochs=config['epochs'])
Epoch 1/10
391/391 [==============================] - 499s 1s/step - loss: 0.6654 - accuracy: 0.6016 - val_loss: 0.5977 - val_accuracy: 0.7028
Epoch 2/10
391/391 [==============================] - 510s 1s/step - loss: 0.6063 - accuracy: 0.6712 - val_loss: 0.5650 - val_accuracy: 0.7282
Epoch 3/10
391/391 [==============================] - 510s 1s/step - loss: 0.5839 - accuracy: 0.6969 - val_loss: 0.5494 - val_accuracy: 0.7362
Epoch 4/10
391/391 [==============================] - 511s 1s/step - loss: 0.5730 - accuracy: 0.7025 - val_loss: 0.5388 - val_accuracy: 0.7455
Epoch 5/10
391/391 [==============================] - 510s 1s/step - loss: 0.5696 - accuracy: 0.7058 - val_loss: 0.5376 - val_accuracy: 0.7417
Epoch 6/10
391/391 [==============================] - 510s 1s/step - loss: 0.5613 - accuracy: 0.7146 - val_loss: 0.5268 - val_accuracy: 0.7517
Epoch 7/10
391/391 [==============================] - 510s 1s/step - loss: 0.5608 - accuracy: 0.7130 - val_loss: 0.5233 - val_accuracy: 0.7544
Epoch 8/10
391/391 [==============================] - 510s 1s/step - loss: 0.5625 - accuracy: 0.7106 - val_loss: 0.5217 - val_accuracy: 0.7555
Epoch 9/10
391/391 [==============================] - 510s 1s/step - loss: 0.5603 - accuracy: 0.7125 - val_loss: 0.5199 - val_accuracy: 0.7535
Epoch 10/10
391/391 [==============================] - 510s 1s/step - loss: 0.5567 - accuracy: 0.7159 - val_loss: 0.5150 - val_accuracy: 0.7591
<tensorflow.python.keras.callbacks.History at 0x7f2fffddba58>

Evaluation

test_text_ds = tfds.load('imdb_reviews', split='unsupervised', shuffle_files=True)
test_ds = test_text_ds.shuffle(1024).batch(config['batch_size']).prefetch(tf.data.experimental.AUTOTUNE)
test_ds = test_ds.map(preprocess)

Check how test text is tokenized

test_text = [record['text'].numpy() for record in test_text_ds.take(10)]
ids = tokenize_text(test_text, config['max_seq_len'])
tokens = [b' '.join(tokens_array) for tokens_array in index2word.lookup(ids).numpy()]
pd.DataFrame({'tokens': tokens})
tokens
0 b"[CLS] spoil ##er - now knowing the ending i find it so clever that the whole movie takes place in a motel and each character has a different room . even sane people have many different aspects to their personality , but they don ' t let them become dominant - - they are controlled . malcolm ' s various personalities and needs were person ##ified in each character . the prostitute mother ( amanda pee ##t ) , the part of him who hated her for being a prostitute ( larry ) , the loving mother he wish he had , the loving father he wish he had , the selfish part of himself ( actress ) , the violent part of his personality ( ray [SEP]"
1 b"[CLS] i knew about this film long before i saw it . in fact , i had to buy the dvd in order to see it because no video store carried it . i didn ' t mind spending the $ 12 to buy it used because i collect off the wall movies . the new limited edition double dvd has great sound and visually not bad . i found myself laughing much more then < br / > < br / > jolt ##ing in fear , although there were a few scenes were i was startled . < br / > < br / > if you enjoy off the wall 70s sci - fi / horror movies , you probably will eat this one [SEP]"
2 b"[CLS] this movie is really really awful . it ' s as bad as zombie 90 well maybe not that bad but pretty close . if your a fan of the italian horror movies then you might like this movie . i thought that it was dam near un ##watch ##able of course i ' m not a fan of the italian movies . the only italian movie that was ok was jungle holocaust . which is one over ##rated movie . this film is way over ##rated . but let ' s get started with how horrible this film really is shall we . the acting is goofy and horrible . the effects suck . no plot with this movie . little gore which is the [SEP]"
3 b'[CLS] wait a minute . . . yes i do . < br / > < br / > the director of \' the breed \' has obviously seen terry gill ##iam \' s \' brazil \' a few too many times and asked himself the question , " if \' brazil \' had been an ill - conceived tale about vampires in the near future , what would it be like ? " well , i \' ll tell ya , it \' d be like 91 minutes of a swedish whore kicking you in the groin , only not as satisfying . the dialogue was laced with gr ##at ##uit ##ous curse words and tri ##te one - liner ##s , and whoever edited this [SEP]'
4 b"[CLS] this is the type of movie that ' s just barely involving enough for one viewing , but i don ' t think i could stand to watch it again . it looks and plays like a mid - seventies tv movie , only with some gr ##at ##uit ##ous sex and violence thrown in . < br / > < br / > i agree with several previous posters - - her ##ve ville ##chai ##ze is not very menacing , and at times even comes off as un ##int ##ended comedy . at least the other two villains make up for that . also , it was jolt ##ing to see jonathan fr ##id is such a pedestrian role , which definitely under - [SEP]"
5 b"[CLS] i like sci - fi movies and everything ' bout it and aliens , so i watched this flick . nothing new , nothing special , average acting , typical h . b . davenport ' story , weak and che ##es ##y fx ' s , bad ending of movie , but still the author idea is good . the marines on lost island find the truth about alien landing there and truth about past - experiments on them . they die one after one , some of them were killed by lonely alien , and others by human enemies . ufo effects , when it flees and crush ##es are bad , too . the voices of angry alien are funny , too . [SEP]"
6 b"[CLS] i was lucky enough to see a preview of this film tonight . this was a very cool , eerie film . well acted , especially by ska ##rs ##gard who played his role of terry glass perfectly . sob ##ies ##ki did a very good job too as it seems to me that she has a bright future ahead of her . the music was well placed but was fairly standard . the use of shadows was quite interesting as well . overall , this was quite a nice surprise considering i ' m not much a fan of this genre . 7 / 10 . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]"
7 b'[CLS] my kids and i love this movie ! ! we think that richard pry ##or and the whole cast did a wonderful job in the movie . it means more to us now since the passing of richard ! ! we will miss his sense of humor . but his movies and shows will stay with us forever ! ! we especially love the parts of brad , frank crawford and ar ##lo pear ! ! they had some one liner ##s in the movie that were great ! ! my son and i love to quote those one liner ##s when we see each other and my daughter will join us when we discuss the movie . we thought the moving guys were terrific . [SEP]'
8 b"[CLS] somehow the an ##ima ##trix shorts with the most interesting premises have the worst outcome . mat ##ric ##ulated is the worst of the bunch ( although it ' s a close call with program ) , as it takes a great idea ( showing the machines the beauty of mankind by plug ##ging them in ) and turns it into the worst experience of the 9 . < br / > < br / > as i said , the story begins promising and interesting , but ends with a long , long , long sequence of ' weird ' images , a cross between the famous scenes from 2001 and v ##ga - rain ( who can remember it ) , but not as [SEP]"
9 b"[CLS] while holiday ##ing in the basque region of spain , two couples discover a child whose hands are severely miss ##ha ##pen . the child has been gravely mist ##reate ##d , and , as a result , cannot communicate . the two couples reluctantly decide to rescue her and report her circumstances to the authorities . however , severe weather and the dense ##ness of the forest surrounding their holiday home make it impossible for them to make a quick get ##away . soon , the local inhabitants become aware that the girl is missing , and they right ##ly suspect the holiday - makers of taking her . suspicions and paranoia begin to fest ##er , and it isn ' t long before violence [SEP]"

Run prediction on test reviews

result = model.predict(test_ds)
result.shape
(50000, 1)
result_df = pd.DataFrame({'label': tf.squeeze(result[:10]).numpy(), 'text': test_text})
result_df.head()
label text
0 0.464566 b"SPOILER - Now knowing the ending I find it so clever that the whole movie takes place in a motel and each character has a different room. Even sane people have many different aspects to their personality, but they don't let them become dominant -- they are controlled. Malcolm's various personalities and needs were personified in each character. The prostitute mother (Amanda Peet), the part of him who hated her for being a prostitute (Larry), the loving mother he wish he had, the loving father he wish he had, the selfish part of himself (actress), the violent part of his personality (Ray Liotta and Busey), the irrational emotions he feels and his need to be loved (Ginnie) and his attempts to control those feelings (Lou), the hurt little boy who sees far too many traumatic things in his life, and of course, John Cusack who seems to represent Malcolm himself trying to analyze and understand all the craziness in his mind, tries to follow the rules (accepting responsibility for the car accident), help others (giving Amanda Peet a ride, and stitching up the mother). Very cleverly done!"
1 0.252326 b'I knew about this film long before I saw it. In fact, I had to buy the DVD in order to see it because no video store carried it. I didn\'t mind spending the $12 to buy it used because I collect off the wall movies. The new limited edition double DVD has great sound and visually not bad. I found myself laughing much more then<br /><br />jolting in fear, although there were a few scenes were I was startled.<br /><br />If you enjoy off the wall 70s sci-fi/horror movies, you probably will eat this one up. I was a little dissapointed at how abrubtly it ended. I wanted the movie to keep going, see how things pan out. The DVD revolution has brought so many<br /><br />lost clasics back to life, it is truly wonderful. Blue Sunshine is one of those lost "missing links" of the cinema. Enjoy!'
2 0.485239 b"This movie is really really awful. It's as bad as Zombie 90 well maybe not that bad but pretty close. If your a fan of the Italian horror movies then you might like this movie. I thought that it was dam near unwatchable of course I'm not a fan of the Italian movies. The only Italian movie that was OK was Jungle holocaust. Which is one overrated movie. This film is way overrated. But let's get started with how horrible this film really is shall we. The acting is goofy and horrible. The effects suck. No plot with this movie. Little gore which is the only good thing in the film isn't showed nearly enough to be worth watching this wreck. The zombies are very fake looking. It looks like it's a bunch of dudes wearing cheap dollar store masks. Please avoid this film at all costs."
3 0.251897 b'Wait a minute... yes I do.<br /><br />The director of \'The Breed\' has obviously seen Terry Gilliam\'s \'Brazil\' a few too many times and asked himself the question, "If \'Brazil\' had been an ill-conceived tale about vampires in the near future, what would it be like?" Well, I\'ll tell ya, it\'d be like 91 minutes of a Swedish whore kicking you in the groin, only not as satisfying. The dialogue was laced with gratuitous curse words and trite one-liners, and whoever edited this piece of crap should be shot. I have no real idea of exactly how the whole thing ended because I\'m not really sure what happened during the first part of the film. With so many subplots your head begins to hurt and so much bad acting your head wants to explode this movie should only be viewed with large quantities of beer and at least two other people you can MST3K with. The only thing that made me not stab myself in the eye with a dirty soup spoon was this line: Evil Doctor Guy: "That\'s it, you are not James Bond, and I am not Blofeld. No more explanations!" Dude From Jason\'s Lyric: "I\'m getting paid scale!" The cinematography was shaky at best and the acting was putrid. Also, what was with all the pseudo-1984 posters and PA announcements? The costumes were from the 50\'s, the cars were from the 60\'s, the music was from the 90\'s and I wish I were dead. This movie sucks.'
4 0.274131 b'This is the type of movie that\'s just barely involving enough for one viewing, but I don\'t think I could stand to watch it again. It looks and plays like a mid-Seventies TV movie, only with some gratuitous sex and violence thrown in.<br /><br />I agree with several previous posters -- Herve Villechaize is NOT very menacing, and at times even comes off as unintended comedy. At least the other two villains make up for that. Also, it was jolting to see Jonathan Frid is such a pedestrian role, which definitely under-utilized his enormous talents.<br /><br />But I think the basic problem with "Seizure" is in the storyline. The evil trio that are conjured up from Frid\'s mind are seen too early and too often. They appear to everyone at once, and announce their (murky) plans too early in the picture. In fact, Stone takes this idea and literally shoves it in the viewer\'s face, with a series of challenges for the guests; challenges that it doesn\'t seem like they have any chance of winning, anyway. How much more effective would have been keeping the evil ones in the shadows, preying on each house guest in turn, sowing confusion and doubt among the remaining house guests, who don\'t know who or what is causing the carnage. By having the trio appear early on, to all the "assembled guests", and announcing their plan (confusing as that plan is), much potential for tension and suspense are lost.<br /><br />Also, a more gradual appearance of the evil ones would indicate Frid is slowing losing control of his subconscious. To have Frid subconsciously conjure up these baddies, because he\'s got hidden grudges against his wife and friends, would have been a far more logical plot device. Instead of having Frid play an intended victim from the get-go, it would have worked better to have him slowing becoming helpless to control the menace he\'s created, with mixed feelings of guilt and satisfaction as his shallow, superficial friends are killed off. The plot Stone offers up is confusing as to the origins and, most importantly, the motivations of the evil trio, and never gives any explanation why Frid, from whose mind they came from, can exercise absolutely no control over them. Confusing is the word that best sums up the whole picture, and the end feels like a total cheat. Better to have some great showdown in which Frid is finally able to banish the creations of his own tormented mind.<br /><br />Oliver Stone has done some notable work in his career, but sadly "Seizure" is not among them.'