Topic modeling is an analysis method that lets you apply an algorithm to a large group of texts that attempts, by detecting which words often appear together, to tell you what topics the paper consists of. It's best used on large groups of texts, so it has a larger word count to work from to best see which words are used most frequently in proximity to each other. When you run a topic model analysis on a group of texts, the output is generally two things:
Most methods for topic modeling will let you customize how many words you expect to be in a topic and how many topics you expect to be in your document corpus. You can shuffle around these measures until you get a list of words per topic that seems to match a subject. It isn't very useful for documents with a high degree of standardization like court documents, or song lyrics. It also doesn't tell you a lot about a small group of document, since without a large word count, its results will be skewed. It's best for if you want to put in a vast amount of information like all available letters to the editors in the New York Times in 1901 vs. one day's letters to the editor from New York Times from 1901. It's recommended that you do further reading on the statistical reasoning behind topic modeling if you want to use it in a project, so you can be sure your results are significant.
For a good overview of topic modeling, see Ted Underwood's "Topic Modeling Made Just Simple Enough".
With the Voyant Topics function you can upload your own materials or provide it with a web address to the documents you want it to create a topic model for. You can customize in terms of choosing how many topics it will select (though the sliding scale used doesn't allow you to just type in a number which is very inconvenient since it jumps right from 1 to 4 to 7), and how many terms it will place in each topic, and how many iterations it will do through your corpus to refine the topics. You can also add or remove stopwords.
Voyant's topic modeling tool does have some disadvantages . It is limited in how precisely you can customize it. There is no visualization beyond just a list of the words associated with each topics, and a line tracking how prevalent that topic is throughout the corpus. The additional information regarding how much each topic is represented in each document in the corpus available in a more more quantitative way. You can only see by hovering over each point in the trends line what the percentage is in each document, and you have to move the mouse around until you find the one you want.
While this is interesting as a skimming tool, for a deeper dive, you'd want something like MALLET or a programming language with text analysis components. But if you just wanted a general gist of what was contained in a large corpus without wanting to know which documents contain the majority of which topic, and don't want to go through too much additional programming to get that, this would work.
We'll be getting into this data more deeply when we use it within MALLET on the next module, but this is a collection of letters to the editor in a Chicago newspaper in 1913 and 1914, with each day's letter (or a few time, letters) as its own separate text file. There are only 47 documents, but this is a situation that you might want to use a topic modeler for rather than just a word frequency. Already 47 would be a lot to read through, but if this were 400 or 4000 letters to the editor from a time period or paper you were interested in you wouldn't want to read all of them but would want to see what the broader trends were in what letters the local paper was printing. While a frequency count might tell you 'chief' got used many times, you don't know what that word was used in regards to. However, if you did a topic model, and it was included in a topic alongside words regarding law enforcement like 'crime', 'arrest' and 'police', you'd know that it was being used to refer to a police chief and not a chief executive officer or fire chief or as an adjective.
I found these letters to the editor using the Library of Congress's Chronicling America newspaper website. This site offers the advantage of displaying newspaper pages not just as images but also with an OCR-ed text file version ( which I recommend still cleaning up) which makes it easier for them to be analyzed.
While Voyant's Topic tool is very basic it can help you see an overview of not just the common words within your corpus but which ones appear together. However, it doesn't allow for a lot of customization and only offers simple visualizations. It's probably better for an initial or simple look rather than as the primary tool you use to topic model.
Mallet is a MAchine Learning for LanguagE Toolkit, a program created by UMASS Amherst designed to use an algorithm to analyze large collections of text to determine the topics discussed within. You should note that it will involve you using the command prompt on your computer, and changing some of your computer's settings, so you'll need to be somewhat comfortable with the command prompt and be on a machine where you can make those kinds of settings changes.
This is a more hands-on method of using topic modeling, and requires more processing at the end to give you a visual guide to use, but will give you more detailed information on which topics appear the most in what sources.
Here are the Mac Instructions
You have now installed MALLET on your computer and added an environment variable so MALLET is set for you to access and use.
The command prompt is a more direct way of using the programs on your computer. It strips out things like icons to click on or text boxes to fill out and you're simply telling your computer to access a folder and execute a command. It takes some getting used to but isn't all that hard.
There are sample data sets in the sample-data folder in MALLET but we'll be working with the below data that is Letters to the Editor columns from a Chicago newspaper in 1913 and 1914. We'll use the topic modeling software in order to see which topics are used throughout the letters and if they change over time.
There are overlaps in some of the words used in these topics, and that some topics are clearer than others, for instance:
0 5 judge office american state county election medical candidates owens candidate illinois law boss owned nelson elected lawrence board sheriff
Seems to be pretty clearly about elections and various local offices, but something like:
6 5 men read st street war put papers night letter av fellow justice chief car back cars reply white real
Is less clear.
Between the two topics below, there seems to be a lot of overlap in terms of words concerning newspapers
5 5 day book editor chicago people don free class paper press present men letters mr workers find long called brought
7 5 time things advertising book club meet newspaper ago ing newspapers editorial store fear conclusion daily months win united thinking
To try and make sure the topics coming out of the topic model are ones you can interpret, you have some changes you can make to the MALLET commands. You can experiment with picking less topics in the hope that they will be narrower. You can create a file showing which documents are composed of which topics so you can better see what the topics could mean by reading documents with a high percentage of each topic. To do this, you'd want to add some more options to the default train-topics command and there are some of the options you can try out below.
The program you just ran had a couple of different outcomes: it determined a topic model for your documents, created a sheet containing which documents contained which topics, and created a .gz file that contains all the information about word frequencies within the topics and the corpus.
This is only a small sample size but from this you can see what topic modeling software can tell you about what is being discussed in a large group of documents. If this was your project and you found these results useful you could then scale up your research and look at more years of the newspaper if you were interested in how topics changed over time. You could also compare a liberal and conservative newspaper's letters to the editor, or those from different geographic areas if you were interested in how those letters overlapped in the topics that they discussed.
MALLET will provide you with the words used in each topic it finds in your corpus, and the percentage that each document is associated with each topic, but it doesn't do so in a very visually interesting way. Nor is there a really easy way to see how frequent each word is within that topic. To visualize the information about the words used in a topic, you'll have to take one of the other files that you created with your program, the topic counts text file and feed into the open-source site Lexos, created by Wheaton College, in order to generate a tag cloud for each topic. In this tutorial, that's what you'll be doing.
To be clear, Lexos won't create the topics or do the analysis for you, it will just take the analysis that you've done with Mallet and render it into an informative visualization to be included with your project.
If you did the last exercise, you'll already have this file saved to your mallet folder, however if you didn't, it's below
When you run Mallet on a corpus be sure that you've included a command telling it to create a topic-counts file, otherwise you won't be able to use Lexos to visualize the counts for the most popular words within a topic. The list of 10 words displayed in your console don't represent the full extent of the words within a topic.
If you are using anything that you're concerned about uploading directly to Lexos's website (though this is simply topic counts and so shouldn't contain anything proprietary) you can download an instance of Lexos to run on your computer.
When you ran the MALLET file you created a text file that contained the information about how much each document was composed of each topic. Though with very little work it's easy to look in that file in Excel or Google Sheets and get a general idea as to which topics are most popular and which go together, it's not yet in a very friendly format for creating graphs with a few steps. If you didn't run the last exercise, and just want to see how a topic model file can be prepared for graphing, you can download the file from Mallet with these counts below.
Now that you have your data better organized, you can make it into a chart to visualize what you are seeing.
This is a relatively small sample size, but you can see how if you were looking to compare one year's worth of letters to the editors to another or one paper's to another, this could help you look at a large mass of text and get a sense of what's being discussed in it in a more comprehensive way than a simple word frequency chart might.