Train ULMFiT Language Model with Wikipedia
22 Nov 2018 by dzlabLanguage Modeling (LM) is one of the mean tasks in natural language processing (NLP). Put simply, it aims to predict the next word based on a sequence. For example, given the sentence “I am writing a …”, the word coming next can be “email”, or “blog post”. Put formally, given a sequence of words x(1), x(2), …, x(t), language models compute the probability distribution of the next word x(t+1). This probblem can be solved by many different algorithms.
Before anything, first thing to do in any Machine Learning task is gathering the right Data and cleaning it. In the case of NLP tasks, the data is a collection of texts (also called corpus) that can be of the same language (e.g. for building a language model), or spanning over multiple languages. The Internet is filled of text data, for instance Wikipedia is a great text source and freely available.
In the following, we will first build an Arabic corpus from Wikipedia articles, then train a Language Model on it, to finally predict sentences starting with some initial words.
Data
Starting from a raw Wikipedia dump file costruct a corpus for training a Language Model.
Download the Wikipedia Dump File
A Wikipedia database dump file is quite large (e.g. English dumps are more than 10GB), so downloading, storing, and processing such file can be tricky.
In the following, the Arabic language dump for 2018-11-01 is used (around 800MB). More dumps for Arabic can be found in on Wikipedia dumps - link. First download the data, (no need to un-compress it) and have a look to the different files
Create a Corpus
The arwiki-20181101-pages-articles-multistream.xml.bz2
file is writting in Wikipedia markup language, it contains a mix of page contents, links to other pages or translated versions, images, etc. It needs to be cleaned which can be done using a topic modeling library like gensim. The following Python scripts uses gensim’s WikiCorpus class to construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and store it into multiple text files, each one with same number of articles.
Make sure the gensim library is installed
Turn the above script into an executable and run it against arwiki-20181101-pages-articles-multistream.xml.bz2
:
Note the extraction of the texts from the .bz2
file is a very slow operation.
Model
PreProcessing
First thing, load the raw text files and tokenize them using the appropriate Tokenizer.
Training
Once the data is in the right shape, instantiate a learn, find a suitable learning rate and train it for couple of epochs.
Prediction
The full jypiter notebook can be found here - link.
Additional resources: