top of page



Kosta Zaitsev - CS Department, Ben-Gurion University

The main objective of this project was to practice topic modelling on a big corpus and process, evaluate and visualize the results.

The topic modelling was performed using MALLET tool, on a corpus of more then 500,000 songs lyrics. I decided to divide the lyrics into 50 topics - one the one hand, to generalize as much as possible every song topic, but on the other hand, 50 topics are enough to be as accurate as possible.

The next step was to cluster the results by artist, calculate the averages of the topics proportions, of their songs and decide which topic is the most related to each artist.

At this point, all the topics are only identified as groups 0-49. Therefore, the most important step was to label the topics groups with a meaningful text that describes the topic - this was done by NLP techniques and manually for each topic by analyzing the most common words of each topic.

Lastly, some statistics and visualization was done.



As some decisions, that were made, may not be optimal in this context and the fact that some of the data set was created and analyzed manually - results may not be accurate, nor reflect correctly the reality.





Digital Humanities course website:

bottom of page