Skip to Main Content

DAsH

Research Guide for DAsH (or digital humanities) resources and tools

Learning Goals

The Hathitrust Digital Research Center allows you to analyze the 5.7 million volumes kept in Hathitrust's repository, which contains digitizations of the works in many research libraries. You can only remotely read the full content of books that are in the public domain or if your library has a membership in the archive - sorry, we do not - but you can use the research center tool to analyze the data of books that you don't necessarily have full reading access to. 

In this exercise you'll see how the different built-in functions of the Hathitrust Research Center can allow you to explore texts in the database from a distant-reading perspective. You'll create wordclouds that show word frequencies within a corpus, and produce a list of all places, times or characters mentioned within a work.

Hathitrust offers the advantage of presenting you with a ready-made collection of sources, along with the algorithms that can be used to process them. But if you have a different, more modern collection of items that you want to analyze, you cannot upload your own materials for analysis with Hathitrust's algorithms. Something to keep in mind while deciding if this tool is right for your project. 

Getting Started

  • Proceed to analytics.hathitrust.org and choose Sign Up as the option at the top. You'll pick a username and give yourself a password. You'll give them your email so they can send you a confirmation email
  • When you have signed up and confirmed your account you can then sign-in using the Sign In link at the top of the page.

Creating Your Corpus: Making a Collection

A corpus is a technical term for the group of works that you are analyzing. It may be that you want to look at all of one author's works. It may be that you are interested in all of the works in a given genre, or it may be you are interested in the works that came out in a given time-range and area. You'll be creating a workset that consists of the works of Jane Austen. But first you'll need to create a collection that includes the works that you are interested in at the Hathitrust Digital Library

  • Go to HathiTrust Digital Library and click where it says Log In
  • Click where it says See options to Login as Guest. Since Manhattan College is not a partner institution with Hathitrust, you'll need to create a different login for here. 
  • Click on Login With Google, and choose your Manhattan college account
  • Next, go up to the search bar and choose Advanced Full Text Search, type in Austen, Jane and choose Author and click on Search.
  • Search results will appear, and on the left hand side click on Author: Austen, Jane 1775-1817. Use the dropdown to change the amount of items on a page to 100. Locate each of Jane Austen's novels, and put a check next to the version that just contains one volume.
    The novels of Jane Austen are: 
    Sense and Sensibility (1811)
    Pride and Prejudice (1813)
    Mansfield Park (1814)
    Emma (1815)
    Northanger Abbey (1818, posthumous)
    Persuasion (1818, posthumous)
    Lady Susan (1871, posthumous)
  • When you've selected one for each, scroll back up to the top and in the dropdown that says Select Collection, click on it and choose CREATE NEW COLLECTION then choose Add Selected.
  • In the window that pops up, give your collection the name JaneAustenNovels  and the description The novels of Jane Austen. It gives you the choice to make your collection Public or Private, but it's helpful to other researchers if people make their collection public so if their research area overlaps with yours they don't have to reinvent the wheel. Set yours to private though, so that you'll be able to find it easier when it comes time to analyze.
  • Click on Save Changes and a notification will appear at the top that items were added to your JaneAustenNovels collection. Click on the link within that notification and you'll go to your collection. You can also click on the My Collections link at the top of the page to do so. 
  • On the page for JaneAustenNovels, you'll see all the novels you've selected listed. Make sure that you only have seven novels and that their titles don't mention introductions or other extra materials. If possible, since you don't have an opportunity to further edit the text before you analyze, you want to only have the text of the novels themselves, not critical works about them too. 
  • If you have added an extra novel, you can just click on the box next to it and choose Remove Selected. If you realize you missed one, you can just do a search, check its box, scroll to the top, select JaneAustenNovels from the collection dropdown at the top of the search results and add it. 
  • In order to take this collection and then turn it into a workset, you first need to download its metadata
  • You can do this with the Download Metadata button and you'll receive it as a text file, which you can then upload into the Workset Creator

Creating a Corpus: Uploading Workset

  • Log into your account on the Hathitrust Research Center - analytics.hathitrust.org - and click on Worksets
  • Choose Create Workset. For name and description use the same thing you used to create your collection
  • For File, click on Choose File and then select the text file you just downloaded containing the metadata for your collection.
  • Click on Create Workset
  • It will bring you a screen containing your workset, and you have now created a workset that you can analyze

Visualizing with Algorithms: Tag Cloud

Hathitrust has analysis algorithms built into their research center, this is so users who aren't at partner institutions or want to create visualizations and analyses for works whose full text isn't available for download. 

Hathitrusst has made a list of algorithms and their purposes to help you decide which one to use. 

  • Click on Execute under Token Count and Tag Cloud Creator
  • Select the workset as JaneAustenNovels it will appear with your username in front.
  • Give the job a name - JaneAustenTagCloud
  • Among the other actions this algorithm can do when it cleans up your text is to remove stopwords (he, the, i, etc. ) and to look for common misspellings due to bad OCR and correct them. There are two default lists that it will use for stopwords and common mispellings but you can paste in a URL where you've hosted your own stoplist if you wish. In this case, click on Use Default for both the list of stopwords and replacement rules
  • Choose the default that you want to have all tokens (or words) in lowercase. Otherwise the program will think a word capitalized at the beginning of a sentence is different from that same word in lower case in the middle of a sentence
  • Don't alter the text in the section Display only tokens that match this regular expression. This is a default that says it only wants words with letters or that have a hyphen. If you are using a text that has numerals you'd want used, you could alter this to include numbers too, but for now we'll stick with this.  
  • You can also choose how many tokens will be displayed, and in this case, just leave the default (200) and hit Submit
    Your configuration should look something like this, though your job number and collection name will probably be different. 
  • It will take you to the Jobs page where you'll see the Job name appear in Active Jobs. Its status will change from Staging, to Queued to Running and then drop down into Completed Jobs  when it finishes. 
  • If you click on the job name it will open a page with the results from that job. 

The tag cloud created by Hathitrust, words of different sizes depending on how often they are used.

These are the words used most often in the Jane Austen novels you picked (excluding stopwords). The larger a word is the more often it is mentioned. The color of a word doesn't mean anything, it's just to make them distinct from each other. If you hover over a word, its total count will appear in a box. 

The hovering action revealing that the word thought was used 945 times.

You'll see that across all novels, familial relationships seem to be mentioned often (family, father, mother, sister), and that so are other personal descriptions of rank and address such as captain, general, sir and lady. If you are familiar with Jane Austen, are all of the results as you'd expect or are there any words that are missing or present a surprise?

  • Click on the tab token_counts.csv and you'll see a list of each of the words with their count. You can download the full listing by clicking on the button Click here to download token_counts.csv . You might want to graph in a different program or it can just be a handy reference if you want to see just how much more one word is used versus another. Maybe you have another author you'd like to compare her works too and see what is similar and what is different.

Generating Data with Algorithms: Named Entity Recognizer

This algorithm uses the Stanford NLP (Natural Language Processing) model for entities. Basically, it looks at its database of names of people, places, and words that mean time, money or percentages, then at the corpus and records what is used and where. This is a search that will let you pull the amount of times that a given entity (location, person, time, date, monetary amount) are used within your corpus.  This can be helpful if you want to map locations in a large group of works, if you want to see how often money is mentioned in novels in one era versus another, or if you want to generate a character name list to search for in another analysis of the text.

  • From the Algorithms page with your workset, click on Named Entity Recognizer
  • On that page, name your job JaneAustenNLP ,and in the dropdown saying Please Select workset for analysis select the Jane Austen Novels workset that you've created
  • Specify that the language in your workset is English
  • Click Submit and it will be come one of your active jobs. When it finishes, it will go from the Active Jobs section of your page to the completed jobs. Click on its entry in the Job Name column to make

The output will be as follows when it completes. It is only a partial list, but you can download the whole thing by clicking to download entities.csv
Once that is downloaded, open it in Excel or in Google Sheets. I'll be using Excel for the below screenshots, but both have similar functions even if they sometimes have different titles. 

The output, there are columns for the named entity it has found, what type of entity it is, what volume it is in, and where it is located in that book

  • This sheet has four columns, vol_id which is the unique identifier of the book that was analyzed, page_seq, which is the page number the entity appeared on, entity, which is the person, place, date, time, organization or other category of entity recognized, and type which lists the category of entity it is. 
  • Since it's a little difficult to understand what books are being spoken of when there's a volume name instead of a title, in order to make your document more readable, you can take the volume_id and replace it with the name of the book.
  • Go to your Worksets and click on JaneAustenNovels. You'll see the names of the novels and their volume IDs listed.This is what you'll be using to change the name in your spreadsheet. 
  • Then go to Edit > Find and Replace. Where it says Find paste in the Volume id for the first book, and then where it says Replace, paste in its name Lady Susan
  • Click on Replace All and then repeat this with each volume_id and title until your volume_id column now contains the title of the book. Change the header to say Novel. Ordinarily I'd say you should just have created a separate column rather than change your data, but the original data will remain stored with your profile on Hathitrust's website so you can get it back if you need to. 
  • If you want to just proceed to the next step and see how you can graph this data, the completed item is here. 

What can you do with Entities Data?

I'll just be using Excel's built in sorting capabilities in this next part of the exercise.  You can create better graphs in Tableau, so this is mainly to help you see the kind of data that you have. 

  • To see what kind of data types you got from each, you can click to highlight the header row and then click on Data >Filter
  • Click on the Type header (row D) to see the options available to you for the different categories of entities
    Each entry in the Type column as a filter you can choose to only see the rows that have that type listed
  •  Try selecting one at a time of the different types to see what that data looks like for different novels. PERSON will give you an idea of character names, and how frequently each is referred to. It doesn't chart each time a pronoun of she or her refers to a particular character though, so it will only be an incomplete estimate. MONEY might be interesting if you are looking at different novels of a given era, or about a given class to see what denominations of money are referred to, but in this instance it doesn't seem to work due to OCR errors. DATE or TIME might be interesting types to look at if you wanted to see if the author you are writing about predominantly sets thing in one season, or if novels before a certain date have mostly daytime settings. 
  • Select only LOCATION, and then using the vol_id filter, select only Emma.
  • If you scroll through the chart you'll see all of the different locations mentioned in this book. This could be handy for if you wanted to map the locations in the book to see compare it to other books of the period, or to Jane Austen's other novels. 
  • As you'll scroll you'll see some misidentifications like Miss Fairfax who are characters misidentified as locations, so you'll want to do some cleaning before you used this in a real project

Learning Goals

Voyant is a web-based text analysis tool that can do multiple kinds of measurements and visualizations, from word clouds to graphs to network analysis charts. You'll explore how to use Voyant to find out more about the text file provided, Dracula, use Voyant's different settings to interpret the results and export your findings as an image or as a webpage.

Inputting Data

Voyant offers a bunch of different options for how you can input the text that you are planning on using. You can upload a document to analyze, send it to a URL if you are interested in a webpage, or even use one of the two existing corpuses (Shakespeare and Jane Austen) that it already has. In this case, you'll be uploading a plain text file of Dracula, provided below. If you have another work you're more interested in, please feel free to upload it instead although not that it will need to be in plain text format (.txt). Project Gutenberg is a good place to look if you want to look for an older work.

  • Choose Upload and navigate to where you have saved the Dracula text file provided below, then click Open

Voyant's Interface

Once you load in a text, Voyant will take you to its default interface which is a group of different panels that each display a different visualization or chart of information for the text you've loaded in.

  • On the top left side of the screen will be a tag cloud of the most frequently used words in the work (known as Cirrus), and you can adjust the number of words by clicking on the radio next to terms and adjusting it upwards.
  • In the middle of the top section will be the text of the document (Reader), and by hovering over or clicking on a word you can get statistics on its use.
  • On the top right of the screen is a graph (Trends) that by default shows the top terms in the document, though you can search and pick different terms as you please.
  • Below that are the terms in context (Contexts), which will originally display the top term but you can change which term you see by using the search box or by clicking on the term in the reader.
  • To the left of that is a box containing a summary of statistics about the document (Summary)

The Voyant interface is segmented into different graphs and pieces of information

  • You can switch these boxes to have different displays and visualizations in them if you'd like, but more on that later.
  • In the tag cloud interface, click on the circle next to terms and drag it over until 125 tags are being shown. Are there any results that surprise you? It seems strange that 'van' is a largely used word in a book from the 1890s. You can use a different panel to track the context a word is used in and see why 'van' is used so much.
  • Context: Go to the Contexts window and type in 'van'
    the context tool has van typed in as a search term
  • When the answer loads, you'll see that 'van' comes before the name Helsing who is one of the characters, so it's used when his full name is mentioned. When you do this, the Trends window will also change to depict the use of 'van', which can tell you something about when the character is most present in the novel
    The Trends graph of the word Van showing the frequency that the character of Van Helsing is mentioned throughout the work
  • Correlations:  This display shows how terms are correlated with each other. This a measure that sees that if one term is used in a certain portion of the text,  what other terms are more or less likely to be used alongside it. For instance in a novel where a house burns down, when 'fire' is mentioned, 'smoke' probably will be too, but 'cold' probably will not be. So the term 'fire' is likely positively correlated with 'smoke' but inversely correlated with 'cold'. 
    • Click on Correlations at the top of the Contexts panel, and the view will switch.
    • Type in 'van' in the search bar and you'll see that one of the first matches is 'helsing', but just because two words are put together on this chart doesn't mean they're correlated. There are two measures you'll need to look at, Correlation and Significance
       a chart with headings term 1 term 2 correlation and significance
    • The field Correlation lists how often the frequency of term 1 increases when term 2 also increases. The closer it is to 1 the more they are correlated. Significance is the p-value or how likely it is that the terms being used near each other is only chance based on a mathematical analysis. If the term is .05 or lower that means that there is less than a 5% chance that it is chance and so that correlation is significant.
    • For the first one, to no surprise, van and Helsing are correlated 99% of the time with a significance that has to be written in short hand since it's a decimal point followed by 12 zeros and then a nine. If you keep scrolling down you'll see lower values of correlation and eventually higher significances, which means these aren't words that have any kind of discernable relationships. 
    • If you want to see words that have a negative correlation with van (which would mean if van is in a paragraph or page, the other term would likely not be near it, and vice versa) you can click on the header Correlation which will then sort it in ascending order.
      The Correlation chart but now it is sorted so that the words with the lowest correlation are at the top of the list.
    • The trend lines for each term give you a visual indication of what this means. Notice that the ones on the left for term 1 indicate it is more frequent at the beginning and beginning last third of the book which is exactly where the term van dips. None are as high as the word for Helsing but it is a possible avenue for future investigation. Maybe ruined or high are repeated descriptions for a place that that Van Helsing never was?
  • Trends: The trends graph is convenient way to track the use of any term in a book or corpus over its length. It can be really useful to track trends of characters' mentions or appearances over the course of a document since a character name is just a term that can be tracked like any other. You can change the terms being tracked on a trend graph by typing in the words that you want tracked into the box below the graph, adding more as you wish.
    The term lucy is appearing as a search term
  • Add the character names that you see on the tag cloud as being often used to the search terms for frequencies: 'count', 'helsing', 'lucy', 'jonathan', 'mina', 'arthur', 'quincey,' 'doctor'
    The trends graph for different main characters in Dracula.
  • You can take terms off and turn them back on for the sheet by clicking on them at the top. From this trend graph, we can see that the title character (referred to mostly as the count) is talked of by name mostly at the beginning and the end of the novel (dracula is used rarely). We can also see that the 50% mark of the book contains high and comparable mention of the most characters, and then there's a sharp drop off in most characters' mentions between the 60% and 70% mark of the novel, after which Helsing, Mina, Jonathan and the count's names are mentioned again.

It's important to note when tracking these characters you are only looking at mentions of their name as a term and you'd need some kind of entity recognition programming to track, say, Mina when she is referred to as 'she' instead of Mina, but as an overview the graph can give you avenues to investigate. 

Changing Displays

You can change the method of graphing and analysis at any time in the different boxes in Voyant. You can also change which terms are being displayed and some of their parameters. There are suggestions made for common ones as separate tabs at the top of each section. For example for the one at the top left that shows a tag cloud by default, you can switch to a Links view, that connects with lines which words are used in conjunction with each other.

Links: If you hover over a word, the connections it has to other words will appear in bold. The blue words are the keywords, the orange are the words used near them.

Words are connected with links depending on how often they are used connected with each other.

  • Click to Links and explore the connections between words that seem interesting to you. Double-clicking on one of the orange words (collocates) will switch it to a key word and let you explore the words connected to it. You can click and drag a box to better see how the connections between it and other terms work. The graph will let you move boxes and even zoom in and out to better display all the terms.
  • You can change the view to only display the words linked to a given term by right-clicking on that term and choosing Centralize. Right-click on lucy and choose that option.
  • The graph will change to contain your key term and the other terms which are often used close to it
    the word lucy surrounded by words associated with it.
  • If you use this for a character you can see at a glance a bit about what events are associated with them. In this case, words associated with correspondence (dear, p.s., diary, unopened)  but also illness and death (asleep, breathing, loss, death, illness, dead, coffin). You can swap it out for any of the other terms (if you are interested in a place, or who or what else is described as sweet) by right-clicking on them and choosing Centralize.

As previously mentioned, you aren't limited by the default options available, if you want to see the other options for the analysis or visualization tools that you can use in a panel, just hover at the top, and then click on the image of 4 boxes that is the second icon to the left.

The options available when you hover at the top of a panel

You can use visualization tools, grid tools which display results for pure analysis, and if you have multiple documents involved, you can do analysis across that corpus. 

  • Using the different options available, try out items that look interesting to you such as Word Tree
  • BubbleLines
    ‚ÄčBubbles representing different frequencies of mentions in the work

Changing Stopwords

You may have encountered the concept of stopwords elsewhere in text-analysis. They are words that are so commonly used as to both throw off relative frequency counts, and that have relatively little significance to the text on their own like 'the, that, those, a' and others like them. Voyant filters these words out automatically, but you can re-introduce them to the analysis in individual panels or your view as a total. Voyant is still counting these words, just not including them as objects of analysis in the visualizations.

  • To see an example, go into the trends panel and type in the word she. You'll see that the word is included as a term with a given frequency when you type it in
    the word she is included as a term
  • However, if you press enter, the graph containing it is blank. This is because 'she' is one of the stop words and so isn't available to be included as a key term on visualizations. You can change this by changing the options for this tool, which you get to by hovering on the top and clicking on the second icon from the right
  •  The options window will pop up and one of two things you can change will be what is used for stopwords. You can apply these changes just to this tool, or across all of your Voyant analysis depending on whether or not you check apply globally. In my case, I'll uncheck apply globally since I only want to let male and female pronouns be included for this one trends graph.
  • If you wanted all words in the document eligible to be included in analysis, you can change Stopwords to be None. However, you can eliminate only a few words from the stopwords list, leaving the rest of it intact by clicking on Edit List. 
  • Click on this option and it'll open the Edit Stoplist  window, which is a list of terms that are on the stoplist. Scroll up to she and he and eliminate them from the list. Get rid of other words if you'd like as well. 
  • When you click Save, it'll return you to the options window and now the selection for stopwords will be keywords- and a list of letters and numbers. That's the alias that has been given to the new stoplist that you've given it. 
  • Click on confirm and you'll see that the word she is now included on your trends graph. Add he as well and you'll see how the use of the two terms compares throughout the book.
    the terms he and she graphed for use
     

Exporting Your Work

Voyant can create visualizations of trends, patterns or other information within the work you want analyzed, and even more usefully it makes it easy for you to take those visualizations and make them permanently viewable. You may want to use this so you can create a bunch of visualizations to reference later while doing other work with your corpus, or you may want to make interactive examples that will be available to those reading your arguments within your project.

  • In the panel containing your trends graph (or any of your graphs that you are most interested in), hover at the top and then click on the icon of a box with an arrow coming out of it 
  • When you have clicked on this, you'll get a couple of different options for how your graph will be exported. You can either get it as a citation, a URL that will  connect people to your view, HTML code that will embed it, or as a static PNG image. The former options are under the dropdown for Export View the latter for Export Visualization.
  • To see what the URL option looks like, select that option, and click on Export.
  • Voyant will open a new window that contains just the panel in your graph. Copy and paste the URL into an incognito or private-browsing window and note that it still works. This means that you can give this URL to others and they'll see what you see. My exported URL is here
  • Users will be able to add terms and change the configuration of the graph. It still contains the information available about your corpus, so people looking at the results of your project will be able to see your findings and also use the data you've amassed to look at comparisons of their own.
  • If you are planning on using a static version of your graph within a paper or print out or poster, you can also save your graph as a png image. Click on the export icon and then choose export a PNG image of this visualization. By moving the circle along the scaling line you can change the size that your image will appear. However, it doesn't address the resolution so higher resolutions will be blurrier. It is also worth noting that for items like this graph where a legend is needed, that exporting PNG just exports the visualization, not the legend as well so you may need to add further labels after you've created the image.
  • A window will appear called Export Image and prompt you to right click on the image then save it to a folder on your computer.  When it is done downloading, click at the bottom of your browser window to open your file.
  • Your image will be different based on what you chose to save, but it will probably look something like the below file.

Learning Goals

In the last module, you've learned how Voyant can help you look at one document, but its real strength comes from its ability to map trends, connections and associations across large groups of text.  In this example you'll upload in these novels and see how the results differ when you look at a larger group of documents rather  than just one. 

Data

The below text files are from a corpus of gothic literature that I've downloaded from Project Gutenberg. Since they represent both the classic first or first modern appearances of some important horror and gothic archetypes and represent a large swath of time (nearly 100 years) I'll be analyzing them using Voyant to see what they have in common and how they differ. To be noted, they are all in the public domain so using the web-based Voyant tool is okay. If you want to use Voyant for items that have copyright concerns or contain confidential information, you'll want to set up your own Voyant Server, which Voyant helpfully provides instructions for

Getting Started

  • Go to voyant-tools.org and click on Upload
  • In the window that opens up, select the text files in this gothic fiction corpus, and then click Reveal

Voyant Interface

Like when you uploaded a single document, the Voyant page is now showing you different panels that contain by default (starting from the top left, clockwise): a word cloud, the text being analyzed, a trends graph, keyword in context, and a summary. The difference between this and the previous tutorial is when you uploaded one document Voyant was showing you the information for that one document, now it's showing you information for all those documents combined. The Reader has a different colored bar for the part of a text made up by each document, the x-axis of the Trends graph is made up of the different documents, and the Summary panel will have the statistics for each document.

In this case since there is a chronological dimension to these documents, I've made sure the graph goes in chronological order by putting the year of each at the beginning of the file name. If you are working with documents that should be graphed in a certain order you can do the same.
The Voyant window for working with multiple documents with ways included to switch between documents, and indications about which text a comparison comes from.

 

In the Summary section, you can see the most frequently used words, along with the unique words for each document. By moving the radio button marked items you can increase the number shown for each measure.  To keep in mind, the vocabulary density isn't weighted per the size of the document, so items with a longer word count will have less dense vocabularies. The most unique words for each document are often character names, but you can occasionally find surprises like that the word feelings is used uniquely often in Frankenstein compared to all the other works. 

Grid Options

  • Documents: This contains information on each individual document similar to what appears on the Summary page, containing the word counts, unique word forms, and average words per sentence of each document.
  • Phrases: You can use this to look up what the most popular multi-word phrases (referred to at times as n-grams) are across the corpus. The Trend column will plot in which works the phrase is used most often. You can see if a phrase is simply used many times in only one document or fairly constantly throughout.
    • Limit the terms to 2-4 words by dragging over the right circle next to Length until the floating label says 4
    • Click on the Count column twice so that it lists the phrases in descending order. Stopwords aren't excluded in the calculation of this measure, so you'll see most of the most frequently used phrases contain the.
    • You can search for a phrase, or even search for multiple terms at once. Click on the question mark next to the search box to see different options for how to set up searches including wildcards(*, ?) or multiple terms.
    • To capture words around feel and feelings you can set up the search term like so feel|feeling* The | means that it will search for both terms and the * after feeling  means that it will capture both feeling and feelings.
    • To get further information on how phrases that you are interested in are used, you can move over to the next panel to use the context tool where you can search for both 
  • Context: You can enter in a single search term or multiple and see the context in a sentence that it is used in
    • To further explore your result, enter in "feelings of" into the search bar and see the context in each document that it is used in.
  • Correlations: You can enter in a search term and see if there are any words that are used most frequently with it. Just like in the last tutorial regarding correlations, the measure is not related to the two term's proximity to each other in the document but for each of their comparative frequencies. This means the terms are correlated when if one is used frequently in a document, the other is too, and the correlation number will be closer to 1. If the correlation is a negative number it means if one term is often used in a document that means that the other one will be less used. The lower the p value (.05 or less is ideal) indicates a lower likelihood that this probability is by chance. Minimum coverage is the percentage of documents that will need to contain a term before it can be analyzed.
    For instance, if one of the characters lives in a castle in the mountains, you'd expect castle and mountain to have a high positive correlation number, but castle and desert to have a high negative correlation number.
    • In the default listing when you scroll through you'll see that room and chamber  have a high negative correlation number, which makes it look like chamber is used when room is not. 
    • To see if it is a statistically significant finding, type in chamber as a finding and make the minimum coverage 50 to make sure all the books are covered. 
    • You'll see that there is negative correlation between the two, but it's a small one, though the significance value is high, so there is a high possibility that this is just by chance
      Though room and chamber do seem to appear at opposite ends of the corpus, the significance number is too high to make it seem valid

Visualizations: Corpus

You can change the visualization, graph or table that's in a given panel by clicking one of the options at the top or by hovering until the window looking button appears and then click on it. Corpus Tools contains the visualizations or other tools that can be applied across all the documents that you've loaded in, rather than just one. With Document Tools you can just use the tool on one of the document, the same as you were able to in the previous tutorial.

  • Cirrus: This is the one that shows up by default, you can change the number of words appearing by moving the circular button. If you want to only see the words for one of the documents you can switch it from corpus to document and then select that document. Remember, by default this tool doesn't include stopwords, so terms like the, and, I, they etc will probably be more frequent than the ones displayed. Additionally, the placement of words next to each other doesn't mean that they are used together more often. You can add more terms using the sliding button on the bottom.
    • Scroll up to 125. What words do you see now that you didn't before?
  • Trends: This graphs the use of a term or phrase over the course of in this case the entire corpus if you have set it up under corpus tools. 
    • Continuing what we did above, type in the terms "feelings of", feel and feeling then hit enter to graph them across the corpus
      The trends graph for the relative frequency of feel, feeling, and feelings of across the whole corpus. It spikes a little with Frankenstein, goes up and down, then reaches a new high with Yellow Wallpaper
    • The default being plotted here is the relative frequency, so not just how many times a word is used in a particular work, but how that relates to the total number of words in that work.
      Raw frequency would just be how many times the word is used without taking into account how long the work is. To see how different the two are, hover over the top of the section until the options icon appears, and click on it. 

      In the option box that appears, uncheck the box that says apply globally, and then click on the button next to Frequencies that says Raw.
    • By selecting this, your graph will shift to display this new measure. 
      The raw frequency graph of feel, feeling and feelings of. On this one the peaks are Frankenstein, Varney (hugely) and Dracula
    • You'll see this totally changes the books that seem like peak usage of the term you're interested in. As a graph, this would be kind of misleading if you were trying to present it to the reader as proof that a word was used exceptionally more often in one work than another if one is a short story and the other is a novel. But if you were looking for lots of examples of the word used in context in the novels, this graph could tell you where to look to find the most examples.
  • Links: This creates a graph of collocations, or words often used together. By default it starts out with frequent terms. You can select your own term using the search bar. You can increase the number of words allowed between your keyword and the words it is linked to by using the Context button - it will look at the same number of words in both direction. Double-clicking on a word will add more words that it is located near, since at first it will only show the top terms.
    • Since you're working with a larger corpus, you'll want a larger panel to operate in. Click on the export button (that box with an arrow that appears when you hover) and choose a URL for this view and click Export. This will open up the graph in a window all its own so you have more space.
    • We'll be looking at  the different between saw  and heard so first Clear the graph. Increase the Context words to 5 so that we get a larger list of words.
    • In the search bar enter in these two terms. At first, they'll only have one word in common, thought, which you can tell is used by both since the line connected to thought has connections to both saw and heard as opposed to face which is only connected to  saw.
    • To add additional collocates for each word, right click on each term, and click on Fetch collocates 10 times. The graph will populate with more words. It is interactive so you can click and drag each  term apart making more clear which words are associated with saw, which with heard and which with both.
    • If you want to see just the terms associated with one of the words, you can right-click on it and choose Centralize. This can be particularly interesting for a character or a place. Shrink the context to 3, then type castle into the search bar. 
    • A smaller group of links will appear, right-click on castle and choose Centralize
      The word castle surrounded by the many words that are used near it.
    • Take a look at the graph. There are words associated with character names but also with titles a person might have like porter or ma'amselle. You can also see the parts of a castle, and verbs or adjectives associated with it. The words do not differ in size based on how frequently they appear, however, you can hover over them to see how often they appear with castle.
  • Microsearch: You can type in a search term and see where in each book it appears. This can help you with tracking names or concepts to see where they appear in each work (and which works they don't appear in.)
    • Click on the window icon at the top of the panel you want to use for it, and under Corpus Tools select Microsearch
    • To further explore which items have feel or feeling or feelings in them like you did with phrases and correlation, click on the search box and write in feel|feeling*
    • The resulting graph has bars made up of lines representing the word counts of books, each line represents around 3700 words of the document, and each line is divided into segments with that segment not colored in at all (search term not used) or colored in to red to the degree that term is used within that segment. Each document gets its own bar. 
      a bunch of bars with red squares in them representing the use of feel or feeling.
    • This lets you see things at a glance like that feel, feeling and feelings are used throughout Frankenstein, especially in the second half, that Varney the Vampire also contains these words a lot though mostly in the first half, and that other works also have feel, feeling or feelings more often in their first half. If you are interested in the differing linguistics of the eras of Gothic Fiction, this could be useful information for you, to see trends, and look for outliers.
  • Terms Radio: The Terms Radio is more than just a static visualization, it is also animated and so is very useful when you are trying to get an idea of the use of terms over time.
    • Right click on the tool options button and choose Corpus Tools > Terms Radio
    • What will appear by default is the three most used terms as lines on the graph with a corresponding thumbnail at the top of how they are represented throughout the corpus. Additionally there are other frequently used words and if you hover over them, you'll see the term's frequency in different documents highlighted across the graph. The corresponding line will also appear on the thumbnail at the top to show its position throughout the corpus.
    • Hover until you find some terms you are interested in and then click on them to add them to the graph up top. You can remove terms by clicking on them on the graph
    • To see how terms fare across the document click on the play button and the graph will scroll through the corpus
    • To show more of the corpus at once on the initial graph, use the Visible adjustment to change it to a higher number. To make it so that more words appear on the screen at once, slide over the adjuster for Terms to a higher number. 
    • The search box seems to be malfunctioning however, so don't look for terms on the graph that way.
  • Mandala: This is a visualization that is useful if you just want to see at a glance which works contain your word of interest. If it isn't used, then there is no line connecting our search term to the work. Maybe instead of having around a dozen gothic novels loaded in, I have a hundred, and I just want to talk about castles in my paper so I want to know which books mention it so I can read them without having to read all the rest to find out.
    • Click on the Tools option for a panel, and choose Corpus Tools > Mandala
    • It loads with the most used words by default, but since those in this case, with the exception of emily are in all the books, let's add less common terms. Click on Clear  and you'll see the board clear and the boxes representing each document go to the margin of the circle
    • Click Add and type in castle, then choose Update. That item will be within the circle and you'll see that there are connections made to the works that it is in. Hovering over castle will highlight the lines between it and the works that it is connected to. Works it is not connected to will fade out.
      the word castle with some connections to it
    • You can add in other terms you're interested in, for my search I've chosen ghost*, night, creature*, haunt*, monster and fiend since they're popularly associated with gothic fiction. Note that you'll have to enter in each term separately, or they will appear in the same box. Whenever you hover over a term it will highlight the documents that contain that term, and whenever you hover over a document it will highlight the terms associated with it. For instance, if you hover over the box for 1819_TheVampyre, you'll see that though vampire lore may later have involved castles, in the case of 1819's The Vampyre, it did not since the term castle is not connected to it.
  • Word Tree: This is a way to represent more illustratively how a word appears in context within your corpus rather than just a list of sentences or phrases. You can look what occurs before or after a given term you're interested in.
    • Go to the tool option panel and choose Corpus Tools > Word Tree. It will load with the most popular term as the root term by default. The root term will have words on both sides that is either how a sentence leads to, or continues after the word that is a root term. 
    • Try the word know by typing it into the search bar. You can click on the question mark on the side of the search window to see the different options that you have 
      the word know, with branches off of it to other terms
    • If word connected to your key word is in large letters that means it's used more often before or after your root term than other terms are. You can click on each word to see the sentence before or after the word. 
    •  You can change the amount of branches and amount of context given using the sliding scale on the bottom, as well as the amount of terms included by adjusting the scale next to pool.
  • BubbleLines: This visualization will illustrate with a bubble the distinct use of each term of interest. It divides each work into a section, and if your keyword is used in that section, it will place a bubble on that section whose size is related to how often that keyword is used. You can see at a glance which works are similar to each other in the mentions of certain concepts and where they are spaced out in the work. But, it's important to not that it's not weighted, all documents are divided into the same amount of segments, regardless of how many words are in them, which means the actual word size of each segment can differ a bunch. So you may want to look at each document in the context of other documents of the same word count.
    • Click on Bubblelines which should be one of the options on the Context panel. The default search terms should be the ones most frequently used within the corpus
      The Bubblelines visualization. Circles on the line denote when and how often a term is used within a book. Each book has its own line.
    • Scroll through the visualization and you'll see that some of the lines use all of the terms quite a lot, others not a bunch. By adjusting Granularity you can change how many segments each book is divided into. 
    • Click the box for Separate Lines for Terms in order to be better able to see where the terms are used with less overlap between them.
    • In order to see how different documents match up to each other you can click and drag them into whatever order is best for the comparisons that you are most interested in.
    • If you want to be sure your results aren't misleading you, it's a good idea to be using works of the same size. In a document where there are 100 words in a segment, said appearing 5 times would mean it was .05 of that segment, but its bubble would be kind of small. In a document where there are 20,000 words in a segment, said  appearing 20 times would be a much bigger bubble but it wouldn't mean it was a larger part of that segment
    • Click to the next panel Documents and take a look and see which ones you can find that are around the same length. For the next exercise we'll use Frankenstein, Picture of Dorian Grey and Wieland. Though you might want to upload a large amount of documents for a corpus, there will probably be some occasions where you want to zoom in on one or more in particular, and in the below you'll see how. 

Visualizations: Document

  • For the Bubblelines visualization that you've set up with all the works, you're now taking a look at only the three works above. Click on the box marked Documents, and ensure that only Frankenstein, Picture of Dorian Grey and Wieland are selected
  • Now your graph will only contain those documents

You can change other displays to only be for a certain document instead of the whole corpus too. 

  • Under Cirrus, on the box that says Scale for now it is configured to contain the whole corpus, but you can use the dropdown to select one of the works you want to focus on, in this case The Yellow Wallpaper and click on it to put a check mark next to it.
  • This will change the graph to only select the most used terms for that document
  • To change a panel from displaying the information for the whole corpus to just a document you can go to the Tools Option panel and select from the Document Tools section, which you'll notice is more limited. But if you want to see only the context for a certain document you can select that one from the dropdown marked Scale.

Changing Stopwords

Stopwords are words that Voyant will count but not include in its analyses because they are so commonly used as to be meaningless in helping you see the unique qualities of the work you are studying. You can alter Voyant's lists of Stopwords by adding or removing stopwords. 

emily keeps appearing as a most frequently used word, however when you look at the trends graph of its frequency you see that it's only used in one work. If you want it to be taken out of the words to analyze, you can add it as a stopword, either for one of your panels or across all of them.

In this case I'll be going to the Terms Radio panel since including that word has made it show up as one of the bold lines across the whole corpus and I want the bold lines to be all words that aren't names.

  • When you hover at the top of the panel, a radio button will appear
  • Next to where it says Stopwords click Edit List
  • At the bottom of the list add emily
  • Click Save, uncheck the box that says apply globally since you don't want to remove it from other analyses in other panels and click confirm
  • It won't necessarily change in the panel but if you click the arrow and select that you want to export the chart as a URL, the Terms Radio graph will now not include emily on its trend lines of most used words 

Instead of adding a new stopword, you may wish to remove one. Included among stopwords are pronouns, so if you want to get a sampling of words used around men and words used around women by checking the links to the words he and she, you're out of luck until you revise your stopwords to take out these pronouns. 

In this example, we'll be using the links tool, but if there's a different tool you want to use and see what happens when you remove some of the stopwords, you can use a different one. 

  • Click on the line with the dot that appears when you hover over the top of the panel. Click where it says Edit List next to the stopwords, and scroll through the list until you find she and he and delete each of them.
    A box marked Edit Stoplist. In it "she" is highlighted
  • When you click save the dropdown next to Stopwords will have changed to instead read keywords- and then a string of numbers. Uncheck the box that says apply globally since you just want to make the change for this panel and click on Confirm
  • Immediately the Links panel will change to show you some kind of outcome since he is now the largest word by far on the graph, and when you hover over each you'll see it's used nearly twice as often. Remember, it's only the use of these pronouns being measured, not when characters are called by their proper names, but with a ratio that large, it's hard not to infer that male characters are taking more action than female ones since he and she would likely be used before that character took action.
    the word he is humoungous on the graph while she is much smaller

You can right click on each and choose Centralize to just see the words used in context around that word. Are there any differences that you notice?
the word he surrounded by the words used in common with it.
The word she with words associated around it

There are other visualization items that you can use to explore and illustrate your findings across a corpus, though some of them need more investigation and use of statistical knowledge.

Learning Goals

Sometimes you may be working with data that you don't want stored on Voyant's servers. Maybe it's because you want to analyze a series of confidential interviews. Maybe the materials you are using are protected by copyright and though it is perfectly okay for you to be analyzing them in this way  (see court cases here) it is safer not to upload such materials to the internet, however temporarily. Or maybe you know that you'll be doing your analysis work when you're not able to be connected to the internet reliably. In this case you can download an instance of Voyant that processes using your computer, rather than with Voyant's webservers that will let you work with Voyant's analysis tools without what you are analyzing leaving your own computer.

Getting Started

This is something you'll need to install on your own computer, since the library computers can't have anything installed on them. Check and make sure that you have enough free space on your hard drive. Unzipped, it's about 600 MB since it contains all of the same analysis tools available on the site, just kept on your computer.

  • Head to the download page and download the latest release of the Voyant server. 
  • When it has finished downloading, right-click on the zip file and extract it. 
  • To open it, you just need to double-click on the .jar file in the extracted Voyant folder. It will open a web browser window containing the Voyant tools application, but you'll be working with the copy on your computer so the data won't leave your computer. 

Using Voyant Server

The interface and panels are the same as is described in the first and second modules, it's just being processed on your computer. You'll still be able to access internet content if you are connected to the internet by pasting in the URL. It is important to note though that you won't be able to use the export function to create an interactive embedded view of your data this way, just static images, though you still will be able to use the export function to expand a panel into a new window.