What’s in a Song? LDA Topic Modeling of over 120,000 Lyrics

How to find underlying topics in song lyrics by implementing Latent Dirichlet Allocation in Python using gensim, NLTK, and pyLDAvis

Tim Denzler
7 min readMay 14, 2021

Have you ever had the feeling that most popular songs on the radio or on the charts cover the same topics? Well…I did, which is why applied Topic Modeling to a corpus containing lyrics of over 120,000 songs in order to discover underlying themes.

Photo by C D-X on Unsplash

Here are the main Python libraries used for this project:

  • langdetect for filtering songs that are not primarily in English
  • NLTK for preprocessing our song lyrics
  • gensim for generating our topic model
  • pyLDAvis for visualizing our topic model

Dataset

The main source of our topic model is the AZLyrics dataset from Kaggle [1]. It contains approximately 147,000 songs scraped from AZ Lyrics available in multiple CSV files and includes artist names, artist URLs, song names, song URLs, and lyrics. If you want, you can also rerun the code provided by the dataset’s author to scrape your own data [2].

Topic Modeling and Latent Dirichlet Allocation

A topic model is a generative model that intends to discover underlying topics in a collection of documents and each documents’ assumed closeness to this topic. A popular and well-established Topic Modeling algorithm is Latent Dirichlet Allocation (LDA), which is a probabilistic generative model that builds on the assumption that every document in a literature corpus is a mixture of latent topics and each of these topics themselves is a probability distribution over words. For more information, you can check out the original paper by Blei et al. [3], or these blog posts [4],[5].

Step 1: Reading the CSV Files and language-based Filtering

We start by importing the csv library to read our csv files, the os library for interacting with the operating system, and the nltk library for preprocessing. Optionally, you can import pickle to store interim results and tqdm to track progress in your script.

As our data is stored in separate csv files, we go through each file and append each song’s lyrics to our lyric_corpus list. In addition, we use this opportunity to skip instances of songs that miss lyrics, or which are not primarily in English. We leverage the Python library langdetect, which utilizes a Naïve Bayes Classifier pre-trained on Wikipedia data, and set its probability threshold to 0.95.

import csv
import os
import nltk
import pickle
from langdetect import detect_langs
from langdetect.lang_detect_exception import LangDetectException
lyric_corpus = []
lang_filtered = 0
for filename in os.listdir('azlyrics-scraper'):
path = './azlyrics-scraper/'+filename
with open(path, 'r') as f:
data = csv.reader(f)
headers = next(data)
for row in data:
if len(row)< 5:
continue
if not row[4]:
continue
try:
langlist = detect_langs(row[4])
for l in langlist:
if l.prob < 0.95 or l.lang != 'en':
lang_filtered += 1
continue
else:
lyric_corpus.append(row[4])
except LangDetectException:
continue

Step 2: Lyric Tokenization

Next, we tokenize our lyrics using NLTK’s RegexpTokenizer. After doing this, our lyrics previously stored as a single String for each song are now a list of Strings of single words, also referred to as tokens.

from nltk.tokenize import RegexpTokenizerlyric_corpus_tokenized = []
tokenizer = RegexpTokenizer(r'\w+')
for lyric in lyric_corpus:
tokenized_lyric = tokenizer.tokenize(lyric.lower())
lyric_corpus_tokenized.append(tokenized_lyric)

Step 3: Removing Numeric Tokens or Tokens with less than 3 Characters

We then remove numeric tokens as well as tokens containing less than 3 characters (melodic sounds such as ‘oh’, ‘na’, ‘la’, ‘da’ distorted the topic model in previous iterations).

for s,song in enumerate(lyric_corpus_tokenized):
filtered_song = []
for token in song:
if len(token) > 2 and not token.isnumeric():
filtered_song.append(token)
lyric_corpus_tokenized[s] = filtered_song

Step 4: Token Lemmatization

To further improve comparability, we use NLTK’s WordNetLemmatizer, which groups together inflected forms of words [6],[7].

from nltk.stem.wordnet import WordNetLemmatizerlemmatizer = WordNetLemmatizer()for s,song in enumerate(lyric_corpus_tokenized):
lemmatized_tokens = []
for token in song:
lemmatized_tokens.append(lemmatizer.lemmatize(token))
lyric_corpus_tokenized[s] = lemmatized_tokens

Step 5: Removing Stop Words and Profanities

In the final preprocessing step, all words holding little to no additional information for Topic Modeling are removed in order to further reduce dimensionality. As such, NLTK’s stopwords, a list of common stop words in the English language is imported and extended specifically for song lyrics.

from nltk.corpus import stopwordsstop_words = stopwords.words('english')
new_stop_words = ['ooh','yeah','hey','whoa','woah', 'ohh', 'was', 'mmm', 'oooh','yah','yeh','mmm', 'hmm','deh','doh','jah','wa']
stop_words.extend(new_stop_words)
for s,song in enumerate(lyric_corpus_tokenized):
filtered_text = []
for token in song:
if token not in stop_words:
filtered_text.append(token)
lyric_corpus_tokenized[s] = filtered_text

In addition, I decided to filter out profanities for the purpose of our article using a curated list. As this may distort the results, feel free to skip this step.

profanities = []
with open('profanity.txt', 'r') as file:
prof_string = file.read().replace('\n', '')
prof_tokens = prof_string.split(", ")
for token in prof_tokens:
profanities.append(token)
for s,song in enumerate(lyric_corpus_tokenized):
filtered_text = []
for token in song:
if token not in profanities:
filtered_text.append(token)
lyric_corpus_tokenized[s] = filtered_text

Step 6: Dictionary Creation and occurrence-based Filtering

In order to perform Latent Dirichlet Allocation, we use the popular and well-established Python library gensim, which requires a dictionary representation of the documents. This means all tokens are mapped to a unique ID, which reduces the overall dimensionality of a literature corpus. In addition, we filter out tokens that occur in less than 100 songs, as well as tokens that occur in more than 80% of songs.

from gensim.corpora import Dictionarydictionary = Dictionary(lyric_corpus_tokenized)
dictionary.filter_extremes(no_below = 100, no_above = 0.8)

Step 7: Bag-of-Words and Index to Dictionary Conversion

Each song (as of now a list of tokens) is converted into the bag-of-words format, which only stores the unique token ID and its count for each song.

from gensim.corpora import MmCorpusgensim_corpus = [dictionary.doc2bow(song) for song in lyric_corpus_tokenized]
temp = dictionary[0]
id2word = dictionary.id2token

Step 8: Setting the Model Parameters

Latent Dirichlet Allocation in gensim allows for setting several modeling parameters: (1) number of topics k, (2) chunk size, (3) number of passes through the literature corpus during training, (4) maximum number of iterations, (5) alpha hyperparameter, and (6) eta hyperparameter.

  • number of topics k = describes how many topics should be discovered and has to be set by us. We decide on 6 topics to be identified.
  • chunksize = the number of documents considered in each training cycle
  • passes = number of passes through the corpus during training
  • iterations = maximum number of iterations
  • alpha and eta hyperparameter = influence the Dirichlet distribution. A lower alpha value increases the importance of having documents composed of a few dominant topics, while a lower eta (or beta) value increases the importance of having topics composed of a few dominant words.
chunksize = 2000
passes = 20
iterations = 400
num_topics = 6

Step 9: Execute Model Training

Finally, we train a LDA topic model for 6 topics. Due to the large corpus this may take some time (up to half an hour).

from gensim.models import LdaModel
lda_model = LdaModel(
corpus=gensim_corpus,
id2word=id2word,
chunksize=chunksize,
alpha='auto',
eta='auto',
iterations=iterations,
num_topics=num_topics,
passes=passes
)

Optionally, we can calculate a coherence score of our model, which is a measure to assess model quality, something that is quite difficult to evaluate in the Topic Modeling domain. Topic coherence indicates whether different terms within a topic belong together. The gensim library offers four different measures to calculate topic coherence: (1) UMass, (2) UCI, (3) NPMI, and (4) Cv. As demonstrated by Röder et al. [8], 𝐶v is the measure with the best performance in terms of quality when comparing correlation with human ratings, despite being outperformed by other measures in terms of runtime. Cv coherence measures topic coherence in the range [0; 1], with lower scores implying less coherence within a topic.

from gensim.models.coherencemodel import CoherenceModelcoherencemodel = CoherenceModel(model=lda_model, texts=lyric_corpus_tokenized, dictionary=dictionary, coherence='c_v')
print(coherencemodel.get_coherence())

Step 10: Visualizing the LDA Model using pyLDAvis

As we would like to visualize our generated topic model, we use pyLDAvis. pyLDAvis is the Python implementation of the LDAvis method proposed by Sievert & Shelley [9].

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
vis_data = gensimvis.prepare(lda_model, gensim_corpus, dictionary)
#pyLDAvis.display(vis_data)
pyLDAvis.save_html(vis_data, './Lyrics_LDA_k_'+ str(num_topics) +'.html')

Saliency describes the information a word provides about topic association. In pyLDAvis, the displayed relevance of a word relies on the factor 𝜆. A value of 𝜆=1 implies a ranking based exclusively on topic-specific probability, while a value of 𝜆=0 implies a ranking based exclusively on each term’s lift and decreases the weight of terms with high global frequency throughout the literature corpus. Even though Sievert & Shelley [9] suggest a default value of 𝜆=0.6 based on a performed user study, different 𝜆 values may help to narrow down and label a suggested topic’s true theme.

The Final Result

You can find an interactive visualization of our generated Song Lyrics Topic Model on GitHub Pages. In addition, the figure below may give you an idea of the six topics prominent in our song lyric corpus.

Interactive Topic Model Visualization of Song Lyrics

If we examine our six topics, we can label them based on relevant keywords as follows:

  • Topic 1: Obstacles & Time (e.g., ’never’, ’could’, ’time’)
  • Topic 2: Religion (e.g., ’heaven’, ’lord’, ’heart’)
  • Topic 3: Romance (e.g., ’love’, ’baby’, ’want’)
  • Topic 4: Wealth & Flexing (e.g., ’money’, ’check’, ’club’)
  • Topic 5: Home & Nature (e.g., ’home’, ’town’, ’road’)
  • Topic 6: Violence (e.g., ’kill’, ’dead’, ’war’)

Of course, this labeling is subjective and there may be more distinct topics when changing the granularity and increasing the number of topics to be modeled. However, this project and article may provide a first glimpse of the main topics covered in songwriting.

If you want to rerun the code or generate your own topic model, check out my GitHub repository.

References

[1] A. Suarez, AZLyrics song lyrics (2021)

[2] A. Suarez, azlyrics-scraper (2021)

[3] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet Allocation (2003)

[4] T. Ganegedara, Intuitive Guide to Latent Dirichlet Allocation (2018)

[5] R. Kulshrestha, A Beginner’s Guide to Latent Dirichlet Allocation(LDA) (2019)

[6] H. Jabeen, Stemming and Lemmatization in Python (2018)

[7] D. Tunkelang, Stemming and Lemmatization (2017)

[8] M. Röder, A. Both and A. Hinneburg, Exploring the Space of Topic Coherence Measures (2015)

[9] Sievert, C., and Shirley, K.; LDAvis: A Method for Visualizing and Interpreting Topics (2014)

--

--

Tim Denzler

Data Engineer focusing on NLP, Research, Data Analytics, and the Semantic Web.