I write about my explorations in AI and other quaintitative areas.
For more about me and my other interests, visit playgrd, quaintitative or socials below
When faced with a huge body of text, sometimes we may just want to see if there are common threads running through the text.
Topic modelling is a good way to do that. Basically, the algorithms search of clusters of words that may(or may not) be logical common threads (topics) in the text.
Latent Dirichlet Allocation (LDA) is one of the most commonly used methods for extracting groups of topics from text. Latent Semantic Indexing is another possibility. There are good posts on LDA and LSI on the interwebs, so I shall not go into the math or theory of these methods. Rather, I shall just demonstrate what the LDA and LSI is good for - discovering groups of topics in body of text (or what most term as the corpus).
First, we need to find something interesting to run LDA/LSI on. Ideally, the larger the more interesting. But as this is just a demo, we will use ‘Ulysses’ from the Project Gutenberg site.
source = 'https://www.gutenberg.org/files/4300/4300-0.txt'
Download the data with the ‘requests’ library as per what we have done for the other scripts
import requests
import re
corpus = requests.get(source)
Using a combination of string functions and the ‘re’ library, we strip out some of the special characters (e.g. newlines)
corpus_text = corpus.text.strip('\ufeff')
regex = re.compile(r'[\n\r\t]')
corpus_text=regex.sub(' ', corpus_text)
corpus_text = corpus_text.replace('\\', '')
corpus_text[:1000] # First 1000 characters
For this exercise, we will use gensim to do LDA and LSI, so let’s import the libraries first
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
The following function is just a simple one to process the text we downloaded, and strip out the stopwords (things like is, are …)
def get_tokens(text):
return [t for t in simple_preprocess(text) if t not in STOPWORDS]
tokens = get_tokens(corpus_text)
We then convert into a basket of words - basically pre-preparing the text so that we can feed it into the the LDA/LSI model.
dictionary = gensim.corpora.Dictionary([tokens])
corpus = [dictionary.doc2bow(token) for token in [tokens]]
Now to build the LDA model for this text. The main parameters are:
lda_model.show_topics(5)
And the LSI equivalent
lsi_model = gensim.models.LsiModel(corpus, id2word=dictionary)
lsi_model.print_topics() ```
The topics generated off such a small text body is obviously not satisfactory, but the simple script shows how simple it would be to do this for any other collection of text.
We must however understand that these methods can only show us the collection of words that are close to each other and can be grouped as topics. We will still have to eyeball to see if they make sense, and how we would link the words in each of these clusters.
The Jupyter notebook with the code is here