Basic NLP with PyTorch Text

PyTorch Text is a PyTorch package with a collection of text data processing utilities, it enables to do basic NLP tasks within PyTorch. It provides the following capabilities:

Defining a text preprocessing pipeline: tokenization, lowecasting, etc.
Building Batches and Datasets, and spliting them into (train, validation, test)
Data Loader for a custom NLP dataset

import torchtext

Text processing

torchtext.data.Field is a base datatype of PyTorch Text that helps with text preprocessing: tokenization, lowercasting, padding, umericalizaion and Building vocabulary.

TEXT = torchtext.data.Field(
  tokenize    = spacy_tokenizer,
  lower       = True,
  batch_first = True,
  init_token  = '<bos>',
  eos_token   = '<eos>',
  fix_length  = seq_len
)

Tokenizing and Lowercasting

minibatch = [ 'The Brown Fox Jumped Over The Lazy Dog' ]
minibatch = list(map(TEXT.preprocess, minibatch))

Padding text sequence to match the fixed sequence length

minibatch = TEXT.pad(minibatch)

Before being able to numericalize, we first need to build vocab:

1- Count the frequencies of tokens in all documents and build a vocab using the tokens frequencies

tokens = functools.reduce(operator.concat, minibatch)
counter = Counter(tokens)
counter

TEXT.vocab = TEXT.vocab_cls(counter)

It is also possible to build a vocab directly as follows

TEXT.build_vocab(minibatch)

2- Finally numericalize using the constructed vocabulary

TEXT.numericalize(minibatch)

Data Loader

Build a dataset given a training and validation text files, and using the previously built text processing pipeline.

train_ds, valid_ds = tt.data.TabularDataset.splits(
  path       = PATH,
  train      = 'train.csv',
  validation = 'valid.csv',
  format     = 'csv',
  fields     = [('text', TEXT)]
)

Data Loader for Language Modeling

This dataset can be used to build an iterator that produces data for multiple NLP Tasks. For instance, to build the samples to use for Language Modeling using torchtext.data.BPTTIterator.

def dataset2example(dataset, field):
  examples = list(map(lambda example: ['<bos>']+ example.text + ['<eos>'], dataset.examples)
  examples = [item for example in examples for item in example]
  example = tt.data.Example()
  setattr(example, 'text', examples)
  return tt.data.Dataset([example], fields={'text': field})

train_example = dataset2example(train_ds, TEXT)
valid_example = dataset2example(valid_ds, TEXT)

train_iter, valid_iter = tt.data.BPTTIterator.splits(
  (train_example, valid_example),
  batch_size = batch_size,
  bptt_len   = 30
)

The resulting train_iter and valid_iter are iterators over batches of samples that can be used in a training loop.

Notebook - link

Text processing

Data Loader

Data Loader for Language Modeling

Related tips

Fine-tuning LLMs with ORPO using Axolotl and Skypilot

Build a RAG Pipeline Using Google Gemma

Merge LLMs with MergeKit