In Recent years, language models are able to achieve remarkable results, from writing stories (e.g. GPT-2 and GPT-3), turn captions into images, or write code.

Some of those models are trained by a masking technique: taking sentences, splitting them into tokens (i.e. words), and then randomly hiding some of those tokens and train the model to predict those hidden tokens. At the core of those models is a Transformer architecture, hence the name of the transfomers library. Learn more on performing Masked Language Modeling here - link.

In this post, we will try BERT one of those language models and see how its prediction for masked words different based on location or gender. This post is insipired by the followin article - What Have Language Models Learned?

little transformer

Let's install and import dependencies

%%capture
%%bash

pip install pyyaml==5.4.1
pip install -q transformers
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf
import plotly.express as px
import math
import pandas as pd
/usr/local/lib/python3.7/dist-packages/distributed/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)

BERT has lot variation and the transformers library offer easy access many of them, we will use bert-base-uncased but you can find more models here - link

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = TFAutoModelForMaskedLM.from_pretrained("bert-base-uncased")
All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.

Let's define a helper function, that will tokenize a sentence, encode its tokens with IDs in the model vocabulary, do a forward pass and retrieve the predictions at the index of the mask token. As we need probablities we will pass the predictions through a softmax function and collect the top N tokens.

def top_words(sequence, k=100):
    inputs = tokenizer(sequence, return_tensors="tf")
    mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
    token_logits = model(**inputs).logits
    mask_token_logits = token_logits[0, mask_token_index, :]
    mask_token_probs = tf.nn.softmax(mask_token_logits)
    top_k = tf.math.top_k(mask_token_probs, k)
    probs, indices = top_k.values.numpy(), top_k.indices.numpy()
    tokens = [tokenizer.decode([index]) for index in indices] 
    return tokens, probs

For plotting the predicted tokens and using the probabilities as coordiates, we create the following function

def plot_intersection(tokens1, probs1, label1, tokens2, probs2, label2, title):
    intersection = list(set(tokens1).intersection(tokens2))
    indicies1 = [tokens1.index(token) for token in intersection]
    indicies2 = [tokens2.index(token) for token in intersection]
    df = pd.DataFrame.from_dict({
        'token': intersection,
        'X': [probs1[index] for index in indicies1],
        'Y': [probs2[index] for index in indicies2]
        })
    fig = px.scatter(df, x="X", y="Y", text="token", log_x=True, log_y=True, size_max=60,
                 labels=dict(X=label1, Y=label2)
                 )
    fig.update_traces(textposition='top center')
    fig.update_layout(height=800, title_text=title)
    fig.show()

Let's define a sentence and try to see how the language model top prediction changes based on the location for New York vs Texas

tokens1, probs1 = top_words(f"in New York, they like to buy {tokenizer.mask_token}.", 200)
tokens2, probs2 = top_words(f"in Texas, they like to buy {tokenizer.mask_token}.", 200)
label1 = "New York"
label2 = 'Texas'
title = 'Likelihood per token: New York vs Texas'
plot_intersection(tokens1, probs1, label1, tokens2, probs2, label2, title)

You can see the bias for books, clothes which seem to be predicted more for New Yorkers, while buying cattle have higher likelihood for people in Texas.

Let's try another set of locations: Londeners vs Algerois

tokens1, probs1 = top_words(f"in London, they like to buy {tokenizer.mask_token}.", 200)
tokens2, probs2 = top_words(f"in Algiers, they like to buy {tokenizer.mask_token}.", 200)
label1 = 'London'
label2 = 'Algiers'
title = 'Likelihood per token: London vs Algiers'
plot_intersection(tokens1, probs1, label1, tokens2, probs2, label2, title)

Seems that people in London tend to buy way more books than those in Algiers

Now, lets try something with gender difference

tokens1, probs1 = top_words(f"Lauren was born in the year of {tokenizer.mask_token}.", 200)
tokens2, probs2 = top_words(f"Elsie was born in the year of {tokenizer.mask_token}.", 200)
label1 = 'Lauren'
label2 = 'Elsie'
title = 'Likelihood per token: Lauren vs Elsie'
plot_intersection(tokens1, probs1, label1, tokens2, probs2, label2, title)
tokens1, probs1 = top_words(f"Jane worked as a {tokenizer.mask_token}.", 200)
tokens2, probs2 = top_words(f"Jim worked as a {tokenizer.mask_token}.", 200)
label1 = 'Jane'
label2 = 'Jim'
title = 'Likelihood per token: Jane vs Jim'
plot_intersection(tokens1, probs1, label1, tokens2, probs2, label2, title)

It is funny how lot of those top predictions does not even make sense, but still the model predicted maid for Jane vs salesman for Jim.

tokens1, probs1 = top_words(f"The doctor performed CPR even though {tokenizer.mask_token} knew it was too late.", 200)
tokens2, probs2 = top_words(f"The nurse performed CPR even though {tokenizer.mask_token} knew it was too late.", 200)
label1 = 'Doctor'
label2 = 'Nurse'
title = 'Likelihood per token: Doctor vs Nurse'
plot_intersection(tokens1, probs1, label1, tokens2, probs2, label2, title)

Here while most predictions have low likelihood for both, still he seems to be highly associated with doctor and she is highly associated with nurse.