Topic Modelling on Wikipedia

Nowadays, we are witnessing the tremendous growth of data. This situation poses a great challenge to many when it comes to processing data and getting good value out of it.

Data can take one of the following three forms: structured, semi-structured and unstructured.

The trickiest form of data to analyze is unstructured data. A good example is of this are text files, images, videos etc. An increasingly common problem is sifting through a large collection of texts and understanding the general themes present in them. Imagine being able to do this without even having to look at the texts yourself. Turns out, there is a way to achieve this using machine learning techniques; for example, Latent Dirichlet Allocation (LDA), first presented by David M. Blei, Andrew Y. Ng and Michael I. Jordan in an article published in 2003. LDA is a generative probabilistic model for a collection of texts, sometimes called document corpus, which attempts to mimic the way humans generate text. This is done by constructing a three-level hierarchical Bayesian model. One of the main aspects discussed in the article is to build an efficient technique for a model’s parameters estimation.

The trained model, given an unknown text, provides the topics’ probabilities which are in turn considered as an explicit representation of the text itself. In other words, LDA is a tool which converts a text into a vector with topic probabilities. Since topics carry certain meanings, we can compare texts by calculating distance between corresponding vectors. This is the main idea behind context-based recommender systems, where the similarity of articles is defined based entirely on their perceived meanings and not by tags or any other metadata. Topic Modelling can also be a component of search engine algorithms and online advertising systems. In addition to this, it is a powerful tool for discovering hidden topics and relations from very large corpus of texts. There are a lot of recent cases when topic models were successfully applied to biological and medical text mining.

The aim of our project was to obtain a model which can identify general topics, such as Medicine, Politics, Finance, History, Law, etc., in any previously unseen text. The first important step to achieving this was to gather a collection of documents, covering a wide variety of areas. English Wikipedia is one of the best available choices for the following reasons:

  • Over 5.5 million articles (~14GB of compressed data) are available.
  • Articles cover almost every possible area.
  • Each article focuses mostly on one main area.
  • Free to download and use.

The only drawback of Wikipedia is that it comes in the XML format and some effort is required to parse it, in order to get the text in a readable format.

The next step was building a parsing procedure. Since Wikipedia comes as one large file it would take up to 10 hours to parse it, even if it parallelised within one local machine. Taking into account the whole pipeline from parsing, preprocessing to model training, we had to build the process to ensure the scalability of each step. To achieve high performance, we had to split Wikipedia into a set of smaller chunks. This gave us the ability to test and debug the code used for each step quickly which made it possible to use Apache Spark for parsing procedure, since Spark cannot read a single file in parallel mode. By using Apache Spark, every step of the pipeline can be sped up by a factor which is limited only by our budget.

Parsing was carried out using Wikiextractor, an open-source program written in Python. Some customizations were made in order to use it with Spark. Here’s an example of what the text looked like before and after being processed by the parser:

After having parsed all of the chunks, the text needed to be preprocessed prior to being fed into the model. During the preprocessing, the following manipulations were made:

  • Removed punctuation
  • All letters brought to lowercase
  • Replaced diacritics with corresponding letters
  • Removed words with 2 letters or less
  • Removed stopwords
  • Lemmatized each word
  • Checked if each lemmatized word is present in an English dictionary (UK, US, AU) with around 40K words without proper names
  • Removed words appearing in more than 10% of articles
  • Removed words appearing in less than 20 articles
  • Removed articles with less than 150 processed words

Checking whether words appeared in the dictionary was necessary because Spark methods remain precise for vocabularies not exceeding 218 = 262,144 words. Without doing this check, we would have had a vocabulary of around 2 million words. All of the thresholds mentioned above were selected according to best practices.

In order to feed the processed corpus of texts into the LDA model, we first needed to convert it to a Spark dataframe where each text is represented by a sparse vector. This, as well as all other preprocessing, was carried out in SparkML using CountVectorizer, Tokenizer, StopWordsRemover methods.

Now we were ready to train the model. However, this was not as simple as executing a line of code. A number of parameters needed to be correctly set to achieve proper results. The main parameters are presented below:

To sum it all up, optimizer is one of the main parameters which defines the algorithm used to find an approximate solution. Online Optimizer is the only option when dealing with relatively large corpuses (>100MB) and in cases when prior distribution parameters are unknown. The number of topics, as well as prior distribution parameters, represent our vision of the dataset. Another challenging task was to tune the number of topics in order to make sure that it described the dataset in the best possible way from the probability theory standpoint.

All computationally intensive steps were carried out on customly configured AWS EMR clusters which significantly reduced the overall computation time. The cluster configuration for preprocessing and model training are presented below:

  • CPU optimized cluster for preprocessing and parsing
    • M4 large master node
    • 2 C5 2xlarge (8 CPUs , 16 GB RAM) on demand slave instances
    • 18 C5 2xlarge (8 CPUs , 16 GB RAM) spot slave instances
  • RAM optimized cluster for LDA training
    • M4 large master node
    • 2 R3 2xlarge (8 CPUs , 61 GB RAM) on demand slave instances
    • 8 R3 2xlarge (8 CPUs , 61 GB RAM) spot slave instances

In the case of Wikipedia, we ended up with 50 topics which we considered the best number for evaluation of general meanings in texts. Having obtained these 50 topics, we analyzed their most frequent words and the articles assigned to them with the highest probabilities and manually named each of them.

Here’s an example of what our trained model returned when given a previously unseen text. A random block of text from Machiavelli’ The Prince has been used as an input.

We will dive deeper into the whole model optimization pipeline, as well as other aspects, including cluster configuration, in other articles to follow.

Have a project in mind and not sure how to get started?
Take the first step.
Back to overview