Markup languages such us XML are handy for storing and exchanging structured data. For NLP tasks (e.g. text classification), however we may want to work with pandas Dataframe as they are more pratical. The following illustrate an example of parsing XML data. In particulary the Reuters-21578 collection which appeared on the Reuters newswire in 1987. A detailed description of this dataset can be find in this link
Downloading the data
First download the data, un-compressed and have a look to the different files
The lewis.dtd file contains unsurprisingly a DTD describing the structure of the XML files. The *.sgm files contains the data which will be extracted, below is an snippet of one of these files.
Parsing a document
Unsurprising working with text dataset that was created manually is a tedious task, a lot of unexpected problems can be encoountered. Follwing is the list of issues in this dataset and how to solve them.
1. Unicode decode errors
When trying to read file into a UTF-8 string to parse it later as XML, the following error is encountered (for file reut2-017.sgm):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1519554: invalid start byte
What’s happening is that Python with open('path', 'r').read() tries to convert the bytes in this file (assuing they are utf-8-encoded string) to a unicode string (str). Then encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xfc at position 1519554).
What we can do is read the file in binary then iterate over the lines and decode each of them in UTF-8 as follows:
2. Special characters
Additionaly to the invalid utf-8 characters, the files (especially in the <UNKNOWN> tag), contains non valid characters that makes the XML parsing of the file fails:
File "/data/reuters21578/reut2-016.sgm", line 11
XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 5, line 11, column 5
In this case, we have to remove those characters. The following simple RegEx patter will remove all characters of the shape
3. Dates mixed with text
Dates in the <DATE> has the general shape of dd-mm-yyyy hh:MM:ss.SS but in some occasion I encoutered dates that looks like this.
27-MAR-1987 13:49:54.59E RM
27-MAR-1987 13:53:00.39C M
27-MAR-1987 13:58:01.19E A RM
27-MAR-1987 14:01:21.93V RM
27-MAR-1987 14:01:56.71C M
27-MAR-1987 14:02:56.54V RM
9-APR-1987 00:00:00.00 # date added by S Finch as guesswork
In this case a simple RegEx can be used to extract the date data ingoring un-wanted text.
The previous snippets are grouped together into a helper class for parsing Reuters dataset.
This class can used as follows to transform the raw data into a Pandas dataframe: