The Hathitrust Digital Research Center allows you to analyze the 5.7 million volumes kept in Hathitrust's repository, which contains digitizations of the works in many research libraries. You can only remotely read the full content of books that are in the public domain or if your library has a membership in the archive - sorry, we do not - but you can use the research center tool to analyze the data of books that you don't necessarily have full reading access to.
In this exercise you'll see how the different built-in functions of the Hathitrust Research Center can allow you to explore texts in the database from a distant-reading perspective. You'll create wordclouds that show word frequencies within a corpus, and produce a list of all places, times or characters mentioned within a work.
Hathitrust offers the advantage of presenting you with a ready-made collection of sources, along with the algorithms that can be used to process them. But if you have a different, more modern collection of items that you want to analyze, you cannot upload your own materials for analysis with Hathitrust's algorithms. Something to keep in mind while deciding if this tool is right for your project.
A corpus is a technical term for the group of works that you are analyzing. It may be that you want to look at all of one author's works. It may be that you are interested in all of the works in a given genre, or it may be you are interested in the works that came out in a given time-range and area. You'll be creating a workset that consists of the works of Jane Austen. But first you'll need to create a collection that includes the works that you are interested in at the Hathitrust Digital Library
Hathitrust has analysis algorithms built into their research center, this is so users who aren't at partner institutions or want to create visualizations and analyses for works whose full text isn't available for download.
Hathitrusst has made a list of algorithms and their purposes to help you decide which one to use.
These are the words used most often in the Jane Austen novels you picked (excluding stopwords). The larger a word is the more often it is mentioned. The color of a word doesn't mean anything, it's just to make them distinct from each other. If you hover over a word, its total count will appear in a box.
You'll see that across all novels, familial relationships seem to be mentioned often (family, father, mother, sister), and that so are other personal descriptions of rank and address such as captain, general, sir and lady. If you are familiar with Jane Austen, are all of the results as you'd expect or are there any words that are missing or present a surprise?
This algorithm uses the Stanford NLP (Natural Language Processing) model for entities. Basically, it looks at its database of names of people, places, and words that mean time, money or percentages, then at the corpus and records what is used and where. This is a search that will let you pull the amount of times that a given entity (location, person, time, date, monetary amount) are used within your corpus. This can be helpful if you want to map locations in a large group of works, if you want to see how often money is mentioned in novels in one era versus another, or if you want to generate a character name list to search for in another analysis of the text.
The output will be as follows when it completes. It is only a partial list, but you can download the whole thing by clicking to download entities.csv
Once that is downloaded, open it in Excel or in Google Sheets. I'll be using Excel for the below screenshots, but both have similar functions even if they sometimes have different titles.
I'll just be using Excel's built in sorting capabilities in this next part of the exercise. You can create better graphs in Tableau, so this is mainly to help you see the kind of data that you have.
Voyant is a web-based text analysis tool that can do multiple kinds of measurements and visualizations, from word clouds to graphs to network analysis charts. You'll explore how to use Voyant to find out more about the text file provided, Dracula, use Voyant's different settings to interpret the results and export your findings as an image or as a webpage.
Voyant offers a bunch of different options for how you can input the text that you are planning on using. You can upload a document to analyze, send it to a URL if you are interested in a webpage, or even use one of the two existing corpuses (Shakespeare and Jane Austen) that it already has. In this case, you'll be uploading a plain text file of Dracula, provided below. If you have another work you're more interested in, please feel free to upload it instead although not that it will need to be in plain text format (.txt). Project Gutenberg is a good place to look if you want to look for an older work.
Once you load in a text, Voyant will take you to its default interface which is a group of different panels that each display a different visualization or chart of information for the text you've loaded in.
It's important to note when tracking these characters you are only looking at mentions of their name as a term and you'd need some kind of entity recognition programming to track, say, Mina when she is referred to as 'she' instead of Mina, but as an overview the graph can give you avenues to investigate.
You can change the method of graphing and analysis at any time in the different boxes in Voyant. You can also change which terms are being displayed and some of their parameters. There are suggestions made for common ones as separate tabs at the top of each section. For example for the one at the top left that shows a tag cloud by default, you can switch to a Links view, that connects with lines which words are used in conjunction with each other.
Links: If you hover over a word, the connections it has to other words will appear in bold. The blue words are the keywords, the orange are the words used near them.
As previously mentioned, you aren't limited by the default options available, if you want to see the other options for the analysis or visualization tools that you can use in a panel, just hover at the top, and then click on the image of 4 boxes that is the second icon to the left.
You can use visualization tools, grid tools which display results for pure analysis, and if you have multiple documents involved, you can do analysis across that corpus.
You may have encountered the concept of stopwords elsewhere in text-analysis. They are words that are so commonly used as to both throw off relative frequency counts, and that have relatively little significance to the text on their own like 'the, that, those, a' and others like them. Voyant filters these words out automatically, but you can re-introduce them to the analysis in individual panels or your view as a total. Voyant is still counting these words, just not including them as objects of analysis in the visualizations.
Voyant can create visualizations of trends, patterns or other information within the work you want analyzed, and even more usefully it makes it easy for you to take those visualizations and make them permanently viewable. You may want to use this so you can create a bunch of visualizations to reference later while doing other work with your corpus, or you may want to make interactive examples that will be available to those reading your arguments within your project.
In the last module, you've learned how Voyant can help you look at one document, but its real strength comes from its ability to map trends, connections and associations across large groups of text. In this example you'll upload in these novels and see how the results differ when you look at a larger group of documents rather than just one.
The below text files are from a corpus of gothic literature that I've downloaded from Project Gutenberg. Since they represent both the classic first or first modern appearances of some important horror and gothic archetypes and represent a large swath of time (nearly 100 years) I'll be analyzing them using Voyant to see what they have in common and how they differ. To be noted, they are all in the public domain so using the web-based Voyant tool is okay. If you want to use Voyant for items that have copyright concerns or contain confidential information, you'll want to set up your own Voyant Server, which Voyant helpfully provides instructions for.
Like when you uploaded a single document, the Voyant page is now showing you different panels that contain by default (starting from the top left, clockwise): a word cloud, the text being analyzed, a trends graph, keyword in context, and a summary. The difference between this and the previous tutorial is when you uploaded one document Voyant was showing you the information for that one document, now it's showing you information for all those documents combined. The Reader has a different colored bar for the part of a text made up by each document, the x-axis of the Trends graph is made up of the different documents, and the Summary panel will have the statistics for each document.
In this case since there is a chronological dimension to these documents, I've made sure the graph goes in chronological order by putting the year of each at the beginning of the file name. If you are working with documents that should be graphed in a certain order you can do the same.
In the Summary section, you can see the most frequently used words, along with the unique words for each document. By moving the radio button marked items you can increase the number shown for each measure. To keep in mind, the vocabulary density isn't weighted per the size of the document, so items with a longer word count will have less dense vocabularies. The most unique words for each document are often character names, but you can occasionally find surprises like that the word feelings is used uniquely often in Frankenstein compared to all the other works.
You can change the visualization, graph or table that's in a given panel by clicking one of the options at the top or by hovering until the window looking button appears and then click on it. Corpus Tools contains the visualizations or other tools that can be applied across all the documents that you've loaded in, rather than just one. With Document Tools you can just use the tool on one of the document, the same as you were able to in the previous tutorial.
You can change other displays to only be for a certain document instead of the whole corpus too.
Stopwords are words that Voyant will count but not include in its analyses because they are so commonly used as to be meaningless in helping you see the unique qualities of the work you are studying. You can alter Voyant's lists of Stopwords by adding or removing stopwords.
emily keeps appearing as a most frequently used word, however when you look at the trends graph of its frequency you see that it's only used in one work. If you want it to be taken out of the words to analyze, you can add it as a stopword, either for one of your panels or across all of them.
In this case I'll be going to the Terms Radio panel since including that word has made it show up as one of the bold lines across the whole corpus and I want the bold lines to be all words that aren't names.
Instead of adding a new stopword, you may wish to remove one. Included among stopwords are pronouns, so if you want to get a sampling of words used around men and words used around women by checking the links to the words he and she, you're out of luck until you revise your stopwords to take out these pronouns.
In this example, we'll be using the links tool, but if there's a different tool you want to use and see what happens when you remove some of the stopwords, you can use a different one.
You can right click on each and choose Centralize to just see the words used in context around that word. Are there any differences that you notice?
There are other visualization items that you can use to explore and illustrate your findings across a corpus, though some of them need more investigation and use of statistical knowledge.
Sometimes you may be working with data that you don't want stored on Voyant's servers. Maybe it's because you want to analyze a series of confidential interviews. Maybe the materials you are using are protected by copyright and though it is perfectly okay for you to be analyzing them in this way (see court cases here) it is safer not to upload such materials to the internet, however temporarily. Or maybe you know that you'll be doing your analysis work when you're not able to be connected to the internet reliably. In this case you can download an instance of Voyant that processes using your computer, rather than with Voyant's webservers that will let you work with Voyant's analysis tools without what you are analyzing leaving your own computer.
This is something you'll need to install on your own computer, since the library computers can't have anything installed on them. Check and make sure that you have enough free space on your hard drive. Unzipped, it's about 600 MB since it contains all of the same analysis tools available on the site, just kept on your computer.
The interface and panels are the same as is described in the first and second modules, it's just being processed on your computer. You'll still be able to access internet content if you are connected to the internet by pasting in the URL. It is important to note though that you won't be able to use the export function to create an interactive embedded view of your data this way, just static images, though you still will be able to use the export function to expand a panel into a new window.