Skip to Main Content

DAsH

Research Guide for DAsH (or digital humanities) resources and tools

Topic Modeling: an Intro

Topic modeling is an analysis method that lets you apply an algorithm to a large group of texts that attempts, by detecting which words often appear together,  to tell you what topics the paper consists of. It's best used on large groups of texts, so it has a larger word count to work from to best see which words are used most frequently in proximity to each other. When you run a topic model analysis on a group of texts, the output is generally two things:

  1. A list of words that the "topic" consists of. For instance, if you are looking at a group of magazine articles from the 1940s, you might see words like 'ration', 'nylon', 'garden', 'gas card' in Topic 1 together, and 'batter', 'series', 'diamond', 'run' in Topic 2 together, and be able to discern that the subject being discussed in Topic 1 is wartime rationing and in Topic 2 it's baseball. Note: that the analysis will not spit out what the subject of the topic is for you, just give you the list of the words in it. If you've organized the number of topics and number of words per topic ideally, you'll be able to use those words to figure out the subject being discussed in the topic. 
  2. A ranking of which documents contain which topic in which proportion. This can tell you about the documents, and something about the topic as well. Maybe you're not sure from context what Topic 2 is about, but then you look and see that magazine articles from October contain words from Topic 2 40% of the time, but in January it's 10%. That will tell you that your idea about Topic 2 being baseball text is probably right since in January, baseball season is over. 

Most methods for topic modeling will let you customize how many words you expect to be in a topic and how many topics you expect to be in your document corpus. You can shuffle around these measures until you get a list of words per topic that seems to match a subject. It isn't very useful for documents with a high degree of standardization like court documents, or song lyrics. It also doesn't tell you a lot about a small group of document, since without a large word count, its results will be skewed. It's best for if you want to put in a vast amount of information like all available letters to the editors in the New York Times in 1901 vs. one day's letters to the editor from New York Times from 1901. It's recommended that you do further reading on the statistical reasoning behind topic modeling if you want to use it in a project, so you can be sure your results are significant.

For a good overview of topic modeling, see Ted Underwood's "Topic Modeling Made Just Simple Enough".

When to use Voyant?

With the Voyant Topics function you can upload your own materials or provide it with a web address to the documents you want it to create a topic model for. You can customize in terms of choosing how many topics it will select (though the sliding scale used doesn't allow you to just type in a number which is very inconvenient since it jumps right from 1 to 4 to 7), and how many terms it will place in each topic, and how many iterations it will do through your corpus to refine the topics. You can also add or remove stopwords. 

Voyant's topic modeling tool does have some disadvantages . It is limited in how precisely you can customize it. There is no visualization beyond just a list of the words associated with each topics, and a line tracking how prevalent that topic is throughout the corpus. The additional information regarding how much each topic is represented in each document in the corpus available in a more more quantitative way. You can only see by hovering over each point in the trends line what the percentage is in each document, and you have to move the mouse around until you find the one you want.

While this is interesting as a skimming tool, for a deeper dive, you'd want something like MALLET or a programming language with text analysis components. But if you just wanted a general gist of what was contained in a large corpus without wanting to know which documents contain the majority of which topic, and don't want to go through too much additional programming to get that, this would work.

Data

We'll be getting into this data more deeply when we use it within MALLET on the next module, but this is a collection of letters to the editor in a Chicago newspaper in 1913 and 1914, with each day's letter (or a few time, letters) as its own separate text file. There are only 47 documents, but this is a situation that you might want to use a topic modeler for rather than just a word frequency. Already 47 would be a lot to read through, but if this were 400 or 4000 letters to the editor from a time period or paper you were interested in you wouldn't want to read all of them but would want to see what the broader trends were in what letters the local paper was printing. While a frequency count might tell you 'chief' got used many times, you don't know what that word was used in regards to. However, if you did a topic model, and it was included in a topic alongside words regarding law enforcement like 'crime', 'arrest' and 'police', you'd know that it was being used to refer to a police chief and not a chief executive officer or fire chief or as an adjective.

I found these letters to the editor using the Library of Congress's Chronicling America newspaper website. This site offers the advantage of displaying newspaper pages not just as images but also with an OCR-ed text file version ( which I recommend still cleaning up) which makes it easier for them to be analyzed.

Getting Started

  • Download the .zip file with the letters to the editor and extract it to your computer. 
  • Go to Voyant-tools and upload the folder into Voyant. The default interface will appear. If you are not familiar with the interface for Voyant's interface, please consult the tutorial for Voyant on this page
  • Click on the options icon (the circle on a line that appears when you hover at the top of a panel) and choose Corpus Tools > Topics
  • Move the topics wheel to 7 and the terms to 20. I get the results below. You will probably get something similar but there is a random element to how the topic analysis mechanism works, so they may be in a different order than the one in my screenshot below.
    Topic modeling display in Voyant. Each line has a different set of terms that are used close together
  • Since the newspaper's file names are in numerical order, the trends line does actually represent the rise and falls of each topics and if you hover over each spot it will show you the document name and the % of that document that contains that topic. However, while some of these topics seem easy to interpret, like the 2nd and 3rd one down pertaining to elections, others of them like the first one seem to be more vague.
  • To better see what topic 1 may be about, I hover over the highest point in the Scores line and see which document contains a high percentage of that topic.
  • This tells me that the May 25th letter to the editor contains a lot of topic 1. Use the documents pane to scroll to reading that document. In this case it's about a reader starting a club with fellow readers where they discuss the media. Check and see if any of the other letters scoring highly on containing this topic also have content about media or citizen groups. If you're not sure about a topic, you can always check and see which documents score as having a high percentage of them made up of that topic and see what they seem to be discussing.  
  • Play with the options of different # of topics or terms per topic and see if that makes the topics seem any more clearly defined or not. 

While Voyant's Topic tool is very basic it can help you see an overview of not just the common words within your corpus but which ones appear together. However, it doesn't allow for a lot of customization and only offers simple visualizations. It's probably better for an initial or simple look rather than as the primary tool you use to topic model.

What is MALLET?

Mallet is a MAchine Learning for LanguagE Toolkit, a program created by UMASS Amherst designed to use an algorithm to analyze large collections of text to determine the topics discussed within. You should note that it will involve you using the command prompt on your computer, and changing some of your computer's settings, so you'll need to be somewhat comfortable with the command prompt and be on a machine where you can make those kinds of settings changes. 

This is a more hands-on method of using topic modeling, and requires more processing at the end to give you a visual guide to use, but will give you more detailed information on which topics appear the most in what sources. 

Getting Started on PC

Here are the Mac Instructions 

  • Go to MALLET's webpage at UMASS and click on Download.
  • Make sure that you have the Java developer's kit installed on your computer. This isn't the same default one that is installed on most computers, so you'll most likely be needing to install it. 
  • Extract the zip file you got from the MALLET website straight to your C: directory. Rename it mallet so that it will be easier to find it while operating in the command prompt window
  • You'll need to go to the Control Panel and add what is called an environment variable for MALLET to use when it's running on your computer. Basically the people writing MALLET's program didn't know exactly where everyone using it would install it, so they made a variable in the code called MALLET_HOME that would stand in for where the program would go to access the tools and data it needed in the folder the users put the MALLET code in. By putting in an environment variable on your computer that says that MALLET_HOME is C:\mallet (where you installed the program) you're making it so that when the program reaches that variable in the code, it knows where to look for the tools and data it needs.
  • Go to Control Panel (it's accessible from your Start Menu) then to System, Advanced System settings. Click on the Environment Variable button, and then choose New
  • Type in the Variable Name as MALLET_HOME and Variable Value as C:\mallet then click on OK.
    The options described above are printed in there.
  • When it takes you back to the System Settings window, you should now see the Environment Variable you've created listed among any others that have been created by programs you'd installed on your computer. 
    User Variables window and the one you just added is included among them

You have now installed MALLET on your computer and added an environment variable so MALLET is set for you to access and use. 

Using the Command Prompt to access MALLET

The command prompt is a more direct way of using the programs on your computer. It strips out things like icons to click on or text boxes to fill out and you're simply telling your computer to access a folder and execute a command. It takes some getting used to but isn't all that hard. 

  • Access the Command Prompt by typing that into your search bar or by going to Accessories > Command Prompt in your start menu, and it will open a window.
  • It opens by default to your user folder, so you'll need to navigate up to the C drive and then to your mallet folder. You'll start by using change directory or cd .. 
  • Type in cd .. and press enter, which will move you up to your user folder, then once more to proceed to C:\> Then type in cd mallet which will move you in to the mallet folder. 
    After typing in the proper commands, the command prompt window now starts with C:\mallet>
  • Type in bin\mallet and press enter. If everything is right you'll be given a list of commands that you can give MALLET
    The list of command for mallet

Giving Commands to MALLET

  • First, you need to make sure that you are accessing the part of MALLET that takes commands by adding bin\mallet to the beginning of whatever command that you are building. 
  • You know that what you want to eventually do is import the directory that contains the documents you want to analyzed for topics, which is import-dir according to the help menu.  If you want to see what variables are available to you with that command you need to type the command and --help
  • Type in bin\mallet import-dir --help into the command prompt and press enter.
  • It will list the different options you can list with the command import-dir, like removing stop words, telling it what to save the output file as, and other options.

There are sample data sets in the sample-data folder in MALLET but we'll be working with the below data that is Letters to the Editor columns from a Chicago newspaper in 1913 and 1914. We'll use the topic modeling software in order to see which topics are used throughout the letters and if they change over time. 

Data

Importing Data

  • Take the data above, and extract it to within your mallet folder in a sub-folder named LettersToTheEditor.
  • In the command prompt window type in bin\mallet import-dir --input LettersToTheEditor --output letters.mallet --keep-sequence --remove-stopwords and hit enter to execute the command.
  • This tells the MALLET program to import the directory that has your Letters to the Editor files, to keep the files in sequence, remove the stopwords and convert it to a mallet file which makes it just a grabbag of words in the order that they are in, including the file order
  • If you look, you'll notice a new file has been created within your mallet folder called letters.mallet

Training Your Topic Model

  • Now that you've simplified the different works in your corpus into a mallet file, you'll be telling MALLET to be trained by that file to find topics within your document.
  • You make the input the mallet file you just created and tell MALLET to use it to create a model of what topics are involved in the files in the LettersToTheEditor folder using the source material of the letters.mallet file you created from that folder. This command is train-topics
  •  Type in bin\mallet train-topics  --input letters.mallet​
  • There will be a pause and a flurry of text into the screen. It's printing out the topics it has found along with the words in each topic. Since you just added the minimum number of commands it just used the default option of creating 10 topics and didn't create any additional files containing information on which topics are in which document. The topics I received look like the below, however there is a random element involved in creating these topics so if you're following along yours may vary slightly. 

Labeled 0-9, there are 10 topics made up of various words
There are overlaps in some of the words used in these topics, and that some topics are clearer than others, for instance:
0       5       judge office american state county election medical candidates owens candidate illinois law boss owned nelson elected lawrence board sheriff

Seems to be pretty clearly about elections and various local offices, but something like:
6       5       men read st street war put papers night letter av fellow justice chief car back cars reply white real

Is less clear. 

Between the two topics below, there seems to be a lot of overlap in terms of words concerning newspapers
5       5       day book editor chicago people don free class paper press present men letters mr workers find long called brought
7       5       time things advertising book club meet newspaper ago ing newspapers editorial store fear conclusion daily months win united thinking​

To try and make sure the topics coming out of the topic model are ones you can interpret, you have some changes you can make to the MALLET commands. You can experiment with picking less topics in the hope that they will be narrower. You can create a file showing which documents are composed of which topics so you can better see what the topics could mean by reading documents with a high percentage of each topic.  To do this, you'd want to add some more options to the default train-topics command and there are some of the options you can try out below.

Altering Topic Model Options

  • You can see the options available by typing in bin\mallet train-topics --help and pressing enter
  • A list of options you can add to your command to train the topic model will come up. They'll list what the option means, the input it can take from you (decimal? integer? text? filename) and what the default value is. The ones we'll be using are:
    • --num-topics Since in this case, you'll be reducing the number of topics from the default of 10 and seeing whether a smaller number like 5 will yield better results
    • --optimize-interval This will have the program iterate through the samples however many times you tell it to, so it keeps refining what the topics are. We'll give it the value of 20.
    • --output-state This tells it to write a file containing all the information about what the topics are, where they are within the documents and their overall frequency. The input you give it is the name that you want to give the file. It's a .gz file which we'll get into later. We'll use the value
    • --random-seed Since the program has a random element to it, it starts from the seed of a random number to execute the program. However this means that you won't be able completely reproduce any given run of the program. If you give it a seed number, you can both have someone else run the program on your sample and get the same results, or if you wanted to run it the same way but just change a file name or topic number, you'd be able to do that. We'll use the seed 42.
    • --output-topic-keys The information that is spit out into the command prompt window of the words that are in each topic can be saved to a text file. The input you put in here is the name of the file, in our example, letters_5keys.txt
    • --output-doc-topics The information about what percentage of each document contains terms from each topic is stored here which you can open later as a excel file if you want to graph the information. This can also be useful if you're not sure what a topic is. If there is a document that is high in a certain topic, skimming it could give you an idea of what the topic is. We'll be giving it the name letters_5composition.txt
    • --word-topic-counts-file - creates a text file with the different word counts within each topic - this will be useful for graphing in some programs, and we'll give it the name letters_5topics_counts.txt
  • To run your topic model again with these configurations added, go to the command prompt and type:  bin\mallet train-topics  --input letters.mallet  --num-topics 5 --optimize-interval 20 --output-state letters5-state.gz --random-seed 42 --output-topic-keys letters_5keys.txt --output-doc-topics letters_5composition.txt --word-topic-counts-file letters_5topics_counts.txt
    Click enter to execute it.
  • As the results start being printed to the console, you'll see these topics are a bit more interpretable, here's the readout for one of the letters. 
    0       0.07488 advertising club things newspaper fear white chief man time society carried father thinking giving back happiness demand mark crime
    Topic seems to be about advertising and the emotions associated with it, but is a little vague.
    1       0.0766  workers work officials butte members history employers fact priest organization europe france revolution labor company book miners gerente politicians
    Topic seems to be about labor organizations and how they fit in with other power structures
    2       0.16012 man years public girls life love time children live schools laws boys business good habit woman body human editor
    Topic is a bit more clearly than others about civic society and how children and adults fit within it. 
    3       0.11452 judge office editor american chicago state county candidates election medical public owens labor day letter candidate illinois city law
    Topic has words to discuss city officials, office holders, elections and the law. 
    4       0.48976 day book editor letter men people don chicago make free good give read money workers kind class street war
    This topic could be the newspaper itself, it's readers and who gains from its publication? This one seems a bit more ambiguous.
  • The results will show up in the console but will also be in those text files that you've just created. You can use these information in there to either support or disprove the hypotheses you've developed about what these topics mean.

Reading Results

The program you just ran had a couple of different outcomes: it determined a topic model for your documents,  created a sheet containing which documents contained which topics, and created a .gz file that contains all the information about word frequencies within the topics and the corpus. 

  • You'll find the topics in the txt file that you typed in after --output-topic-keys in this case letters_5keys.txt. Navigate to that file in Explorer and right click to open it in Notepad. Type below each topic what you think this group of words might be used in discussion in.
  • You'll see the counts of each words within the topic in the file letters_5topics_counts.txt If you right-click and open it, you'll see that after each word there's 1 number (0-4) which is the topic, then a colon, and then the count for that word within that topic. This isn't all that useful to you right now, but in the next module I'll show you how to use the program Lexos to turn it into a tag-cloud.
  • You'll find the distribution of topics in each document in the text file that you typed in after --output-doc-topics in this case letters_5composition.txt. Right-click on it in Explorer and open it in Excel. I'll be using screenshots of this program, but the functions I'm going to use it for are also available in Google sheets.
  • The sheet will contain a list of document names, and the topics that are associated with them in order of how much that document is associated with that topic. This is a good way to both see which topics are the most popular at different points and to check to make sure you have the right guess about what those bags of words really mean in terms of the topics of discussion in the letters.
    An Excel document listing file names and the percentage each topic is associated with them
  • For instance, you'll see that the July 26, 1913 letter is very associated with topic 4, which seemed to be about the the newspaper itself, it's readers and who gains from its publication, but had a meaning harder to pin down with the words associated with the topic. When you look at the letter with this high percentage it's about a man whose wife was injured by a street car and and he's looking for advice on what to do about it after the officers did not arrest the operator.
  • When you look at the other letters highly associated with Topic 4,  May 11, 1914 and June 12, 1914 they are also people writing in about a specific incident they want to let people know about through the letters column, a crooked seeming employer and a stall that sold faulty meats. So it seems like this topic features highly in letters where people talk about seeing the letters to the editor program as a method for informing other people of personal or local issues rather than as a method to comment on news stories.
  • You can also use this trick to confirm items that you think have a clearly understandable topic from the words associated with them like Topic 1 (the one that seemed to be about labor) which seems to have its highest percentage associated with a July 8, 1914 letter which is about a suppressed miners' strike. 
  • When your data is in a spreadsheet program, you can use filters to see when there is a more even split where a letter consists of two topics close to equally. Since this is a relatively small corpus (47 documents) this can be something that you can just eyeball rather than utilizing some other kind of analysis.
    • First, add headers to make clear what data you're looking at: Make column A Doc #, column B Doc name , C is 1st ranked topic D is  1st ranked topic % E is 2nd ranked topic F is 2nd ranked topic % G is 3rd ranked topic and H is 3rd ranked topic %. I didn't proceed beyond that since the 4th and 5th ranked topics do not have a very high percentage so any results in those columns probably just consist of words that overlap between topics. 
    • If you want to see just the letters where the 1st ranked topic is 75% or more you can use Sort and Filter, click on Filter, Choose Number Views, and select Greater than or Equal to  and type in .75
    • Choose Sort, click my data has headers and that you want to sort by the first ranked column. After it is sorted it will show you that Topic 4 is most commonly the topic that a letter will primarily consist of.
    • Let's say instead you're interested in letters that are divided between two different topics, and you want to see what combination of two topics occur a lot together. You can change your filter for the first ranked topic to be between 40 and 70 and then sort the 2nd ranked topic % to be from largest to smallest - so from the ones with the most even divide between the first and second topics will be at the top of the list.
    • The ones with the more even divide (2nd rank is .40 or above) have topic 4, which from looking at the keywords is language about using the letter to the editor to do freelance news reporting or advocacy, which it looks like is the topic more likely to be combined near equally with one of the other topics. You can always use the file name to check back in on ay of these letters and see if the topic areas are easy to figure out or if you want to change the parameters to try to find more defined topics. 

​This is only a small sample size but from this you can see what topic modeling software can tell you about what is being discussed in a large group of documents.  If this was your project and you found these results useful you could then scale up your research and look at more years of the newspaper if you were interested in how topics changed over time. You could also compare a liberal and conservative newspaper's letters to the editor, or those from different geographic areas if you were interested in how those letters overlapped in the topics that they discussed. 

Learning Goals

MALLET will provide you with the words used in each topic it finds in your corpus, and the percentage that each document is associated with each topic, but it doesn't do so in a very visually interesting way. Nor is there a really easy way to see how frequent each word is within that topic. To visualize the information about the words used in a topic, you'll have to take one of the other files that you created with your program, the topic counts text file and feed into the open-source site Lexos, created by Wheaton College, in order to generate a tag cloud for each topic. In this tutorial, that's what you'll be doing. 

To be clear, Lexos won't create the topics or do the analysis for you, it will just take the analysis that you've done with Mallet and render it into an informative visualization to be included with your project. 

Data

If you did the last exercise, you'll already have this file saved to your mallet folder, however if you didn't, it's below 

Getting Started

When you run Mallet on a corpus be sure that you've included a command telling it to create a topic-counts file, otherwise you won't be able to use Lexos to visualize the counts for the most popular words within a topic. The list of 10 words displayed in your console don't represent the full extent of the words within a topic.

  • Go to Lexos's website, it opens by default onto the upload page
  • From the menu that opens when you click Visualize at the top of the page, choose Multicloud
  • A page will load containing a bar that says Document Cloud, click on that bar to change it to Topic Cloud. The page will change its configuration to prompt you to upload a MALLET topic file.
  • Click on Upload File and then navigate to where your file for the topic counts - letter_5topics_counts.txt is and select it. Its name will appear below the Upload File button when it's set. 
  • Click on Get Graphs, and click OK on the Warning where it tells you that it might take a while. If you are loading in a large file, it will take a lot longer -  but these were short letters so it's a relatively small word count. 
  • When the file finishes loading you'll have a series of images that consist of the words associated with a topic. The words in a larger font occur with more frequency within that topic. 
    The tag clouds for each topic
  • You can take a screenshot of these and include them in your paper or presentation of a project using topic modeling. You might also do this earlier in your research process to better visualize the results of your topic model and help you decide if how you've configured it is finding intelligible topics. While you're on the site itself, hovering over any word will give you its word count within its topic.

If you are using anything that you're concerned about uploading directly to Lexos's website (though this is simply topic counts and so shouldn't contain anything proprietary) you can download an instance of Lexos to run on your computer.

Preparing Your Data

When you ran the MALLET file you created a text file that contained the information about how much each document was composed of each topic. Though with very little work it's easy to look in that file in Excel or Google Sheets and get a general idea as to which topics are most popular and which go together, it's not yet in a very friendly format for creating graphs with a few steps. If you didn't run the last exercise, and just want to see how a topic model file can be prepared for graphing, you can download the file from Mallet with these counts below.

  • Right-click on the file where you've downloaded it and open it in Notepad ++ (You can download it from their site if you don't have it). It should automatically open with the columns clearly defined like so
    a sheet with clearly defined columns though it is a text file
  • Highlight all the data and copy it. Open a new Google Sheets document in your Google drive and paste the data in. 
  • There's no really good way to do this, since the data been spit out in a way that isn't all that conducive to graphing, but I recommend inserting in 5 new columns into your sheet, and calling them Topic 0, Topic 1, Topic 2, Topic 3 and Topic 4.  Leave their rows blank for now. Rename the first 2 columns, Doc #, File name
  • Rename the next columns, 1st ranked topic, 1st ranked %, 2nd ranked topic, 2nd ranked % and so on to 5th ranked topic, 5th ranked %
  • Because the data that you received is set up with the topic # in one column, and the topic percentage in the one next to it, you'll need to move the percentage manually into the column that corresponds with its topic number. Fortunately, by using sort, you can do this systematically.
  • Select the whole sheet, go to the Data tab, select that you want to Sort Range. Use the menu to select that your data has header rows and that you want to sort by 1st ranked topic by A-Z
  • Copy and paste the values in 1st ranked proportion below their corresponding topics according to 1st ranked topic. (If 1st ranked topic says 0 paste the values in 1st ranked %  below Topic 0, if it says 1, paste it into Topic 1 
  • Hide the columns 1st ranked topic  and 1st ranked % then do the same with the next columns 2nd ranked topic, 2nd ranked %, and so on until the proportions are organized by topic for each file. Make sure that you're always sorting the whole sheet by these columns when you do this. It's a bit of a pain, but it doesn't take more than 5 minutes.
  • Once you have it all sorted out and it looks something like this, you'll be ready to move on to adding a column for the date, since you'll want to be graphing the topic proportions by date. 
    The sheet has percentages filled in for each topic
  • Insert a new column to the right of the File name column and make it's name Date
  • The date is in your file name column so to add the date you can just use the RIGHT formula to pull a certain number of characters from the right side of the string value that is in your File name column. It'll look like =RIGHT(B2,14)
    The preview of what the RIGHT function does in Excel
  • When you've confirmed the formula works, take that same formula and paste it down the rest of the Date column. It should
  • Highlight that whole column and copy it. Then choose Edit > Paste Special > Paste Values only. This way instead of the formulas being in that columns it'll just be the text
  • Highlight the column and click Ctrl H to open Find and Replace. Tell it to Find .txt and leave the Replace with field blank. Click Replace All and the column will now have taken .txt off the ends of the file names you copied over.
  • Do a replace for .txt with nothing. Replace the underscore with a /
  • Leave the column highlighted and go to Format > Number and choose Date

Creating a Stacked Bar Chart in Sheets

Now that you have your data better organized, you can make it into a chart to visualize what you are seeing. 

  • Change the topic headers to match what the topics seem to mean. You may have a different idea of what the topics are based on the Lexos diagram that you made, but I'll be going with. 
    • Topic 0 proportion - Advertising & Crime
    • Topic 1 proportion - Labor 
    • Topic 2 proportion - Children & Family
    • Topic 3 proportion - Officials & Elections
    • Topic 4 proportion - Letter to editor as citizen platform
  • Since each letter is a composition of several different topics, if we want to model how much each letter is made up of each percentage over time what we'd want would be a stacked bar chart.  Each bar will represent a letter and each topic percentage is a smaller or larger amount of that bar depending on what percent of the letter uses words within that topic group
  • Highlight the columns C through H to tell it that these are the columns you want to use to make a chart.
  • Click to the Insert tab then click on Chart. Select 100% Stacked Column chart
  • At first the bars will be too small for you to see much of everything. That's because it is indicating all the dates on the gridline, rather than just the ones we have letters for. The newspaper didn't have a letter to the editor printed  every day so to make the chart legible, you'll need to click on the box that says Aggregate column C
  • Click on the title and change it to Topics per letter
    A stacked bar chart
  • From this a few things are apparent. That the language used by the letter writer to express their intention to use this is as a platform for citizen advocacy or journalism is over 25% of the letter (usually way more) in almost all cases. 
  • You can see that letters about children and the family increase towards the later section of the chart as do ones about officials and elections. Since it's organized by date you can look to events occurring in Chicago at this time and see if there was a news event that might motivate this shift.
  • You'll want to add some text to the bottom axis (Y) clarifying that the dates aren't done to scale.

This is a relatively small sample size, but you can see how if you were looking to compare one year's worth of letters to the editors to another or one paper's to another, this could help you look at a large mass of text and get a sense of what's being discussed in it in a more comprehensive way than a simple word frequency chart might.