Load datasets with TorchText

import torch
from torchtext import data
from torchtext import datasets

With TorchText using an included dataset like IMDb is straightforward, as shown in the following example:

TEXT = data.Field()
LABEL = data.LabelField()

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split()

We can also load other data format with TorchText like csv/tsv or json.

CSV / TSV

TorchText can read CSV/TSV files where each line represent a data sample (optionally a header as first row), e.g.:

author  location  age  tweet
John    Rome      23   another lovely day.
Mary    London    19   what a rainy day!

Assuming that we have a data directory and inside it a bunch of tab-separated files for training: train.tsv, valid.tsv, test.tsv. The following snippet will load this example dataset:

# create Field objects
AUTHOR = data.Field()
AGE = data.Field()
TWEET = data.Field()
LOCATION = data.Field()

# create tuples representing the columns
fields = [
  ('author', AUTHOR),
  ('location', LOCATION),
  (None, None), # ignore age column
  ('tweet', TWEET)
]

# load the dataset in json format
train_ds, valid_ds, test_ds = data.TabularDataset.splits(
   path = 'data',
   train = 'train.tsv',
   validation = 'valid.tsv',
   test = 'test.tsv',
   format = 'tsv',
   fields = fields,
   skip_header = True
)

# check an example
print(vars(train_ds[0]))

First, we define a list of tuples fields whre in each tuple, the first element represents a name to use as a batch object’s attribute, and a Field object as second element.

Note, the tuples have to be in the same order as the columns of the tsv file. In the example we use (None, None) to skip the age column. If we wanted to use the author and location columns, we could just use two tuples in the fields and the rest will be ignored by TorchText.

Second, we use TabularDataset.splits to load the .tsv files into train/validation and test sets. We set skip_header flag to True to ignore the first row of each file (by default it is set False).

Finally, we can check one sample of the training dataset and see how tokenization is applied.

JSON

TorchText need the json file to have on object per line, as follows:

{"author": "John", "location": "Rome", "age": 23, "tweet": ["another", "lovely", "day", "."]}
{"author": "Mary", "location": "London", "age": 19, "tweet": ["what", "a", "rainy", "day", "!"]}

Assuming that we have a data directory and inside it a bunch of files for training: train.jon, valid.json, test.json. The following snippet will load this example dataset:

# create Field objects
AUTHOR = data.Field()
AGE = data.Field()
TWEET = data.Field()
LOCATION = data.Field()

# create a dictionary representing the dataset
fields = {
  'author': ('author', AUTHOR),
  'age': ('age', AGE),
  'location': ('location', LOCATION),
  'tweet': ('tweet', TWEET)
}

# load the dataset in json format
train_ds, valid_ds, test_ds = data.TabularDataset.splits(
  path = 'data',
  train = 'train.json',
  validation = 'valid.json',
  test = 'test.json',
  format = 'json',
  fields = fields
)

# check an example
print(vars(train_ds[0]))

The way the fields are defined is a bit different to csv/tsv. Instead of a list of tuples, we create a python dictionary fields where:

the keys are the same keys in the original json object, i.e. author, location, tweet.
the values are tuples where the first element will be used as an attribute in each data batch, the second element is a Field object.

Then, use TabularDataset.splits to create train/test datasets by specifying the file for each dataset and the file format (json in this case).

Finally, we can check one sample of the training dataset and see how tokenization is applied.

In a JSON file, TorchText tokenize string fields but when given a field containing a list of strings it will assume that the field is already tokenized.

Iterators

Before creating iterators of the Datasets we need to build the vocabulary for each Field object:

AUTHOR.build_vocab(train_data)
LOCATION.build_vocab(train_data)
TWEET.build_vocab(train_data)

To create iterators, we use BucketIterator.splits by specifying the datasets, batch size, and a lambda to tell TorchText what key to use for sorting validation/test sets (traning set is shuffled every epoch).

Finally, we can then iterate over batches of the datasets using those iterators.

# determine what device to use
device = torch.device(
  'cuda' if torch.cuda.is_available() else 'cpu'
)

# create iterators for train/valid/test datasets
train_it, valid_it, test_it = data.BucketIterator.splits(
  (train_ds, valid_ds, test_ds),
  sort_key = lambda x: x.author
  sort = True,
  batch_size = 32,
  device = device
)

# iterate over training
for batch in train_it:
  pass

CSV / TSV

JSON

Iterators

Related tips

Fine-tuning LLMs with ORPO using Axolotl and Skypilot

Build a RAG Pipeline Using Google Gemma

Merge LLMs with MergeKit