Skip to Main Content

DAsH

Research Guide for DAsH (or digital humanities) resources and tools

What is Tableau?

Tableau is an information visualization program that can create graphs, charts, and even has some mapping capabilities. It has a bit of a learning curve, but you can apply lots of custom looks, utilize different kinds of graphs and graphing functions, and even create a web-embeddable version so that your readers can interact with your graph. The Public version (which we’ll be working with below) is free, but though it has most of the same functions as the non-free version, there are a few drawbacks. For one, when you are using the Public version, it is just that…public. You’ll be saving your graph to your profile on Tableau's website.  So if you have a working version of your graph that you want to remain totally confidential, this might not work for you. Additionally, as you are saving your graph as a project to the website, rather than to your home computer, if you are in a situation where your internet connection is spotty, you may lose changes that you are making, because you won’t be able to save it to their site. If either of those things makes you want to invest in the paid version of it, this tutorial should still work for you since the functions are pretty consistent.

If you need a more in-depth set of instructions about advanced things that can be done with Tableau, please check out the guides available in the O'Reilly database below.

Using Tableau Public

Lesson Goals

Tableau is a data visualization program that can be used to create many different kinds of tables and charts. With this module, we'll explore Tableau's interface and create worksheets containing different common graphs using a small subset of data about Rotten Tomatoes top-ranked movies for each genre.

Getting Started

  • Go to https://public.tableau.com/en-us/s/download and download Tableau Public. After it has finished downloading, install it. 
  • Visit the Sign Up page and create a login for yourself, and make sure that the password is something that you will remember. You must create a login as you will be logging into Tableau every time you save your work for the site when you are using Tableau Public. 
  • Download the below excel file, RottenTomatoesTop100Movies and open it in Excel or in Google Sheets. Save it somewhere you can find it again easily, since you'll be opening it again shortly in Tableau.

  • Take a look at the data to get an idea of the kind of information that you can visualize with it. It has a list of 100 (or sometimes less) movies that are rated as being the top of their genre with each row containing one movie. Left to right it lists:
    • The Genre movies are being ranked in
    • The movie's Rank for that genre
    • The movie's Rating on the Tomatometer (Rotten Tomatoes' system where it aggregates user and critic's reviews)
    • The Title with and without the year 
    • The number of reviews the movie has on Rotten Tomatoes.  

I got this dataset from the dataset and tutorial website Kaggle 

  • For the movies listed in the Rotten Tomatoes Top 100 in this dataset, I've added information I got from The Movie Database by querying their API. More information on querying an API is available in the Using APIs section of the DAsH research guide. This data from TMDB is
    • The movie's Runtime in minutes or how long it is. 
    • The movie's Release Date (Rotten Tomatoes' dataset only mentioned the year, and sometimes that was only its US Release Year).
    • The Production Country or Countries associated with the movie. 

It's important to keep in mind where the information came from and what that means for what is being measured with a graph made with that information. Even at a glance, it does seem like the higher ranked movies tend to be pretty recent films on this spreadsheet, so it's very possible that our graph would look very different from someone trying to look at a similar ranking compiled five or even three years ago, or a ranking compiled by different users than Rotten Tomatoes' user reviewers and the critics that the site looks at reviews from. 

  • There are over 1600 rows on this sheet, so it’s hard to see exactly what the overall findings are. Even if we created a pivot table summarizing some aspects of the data, it wouldn’t be able to tell the story of what the data shows with as much impact as a graph, so that’s why we’ll be using Tableau to create a graph of the findings. 
  • Think about the different kinds of information you could display with this, you could graph
    • What years are popular for what genres? 
    • If different genres have different runtimes? If runtimes are increasing or decreasing for more recent films?
    • What countries are producing movies popular on Rotten Tomatoes? Has that changed over time? Is that different by genre?

Connecting to Your Data in Tableau

  • Open Tableau Public from the start menu. It will open up on a menu asking you what file you want to connect to. The file that you connect to is the the file whose data Tableau will be using to create your worksheets. Chose Microsoft Excel under the options for Connect to a File.
  • In the Explorer Window, navigate to where you have the RottenTomatoesTop100Movies sheet, and select that you want to open it. 
    It will take a few minutes to open up your file and then give you a screen where it shows you the first few rows of your data. This is where you can make modifications as to what kind of data the fields contain, join the data from two different sheets in the Excel file you've connected to or even create a new field that performs some kind of calculation on your data. We'll only get into the first option in this module, but feel free to experiment on your own for others. 
  • In this case, let's change the data type for Year. The data types  Tableau accepts are Dates, Numbers, String (text), or Boolean (true or false).
  • When you look at the Year column it has a little number sign on top of it. That means that the data type for the column is numbers. But we want to graph this data over time, so we want Tableau to understand that what this column contains is a date. Click on the little number icon.
  • From the dropdown menu choose Date and the icon next to Year will switch to a calendar and the numbers in the column will switch to look like dates.
     
  • Let's make sure that the other items have the right designation, the ones with numeric data have the little # and the ones that are countries have a globe next to them indicating that they are geographic data. We can do this by scrolling down in the table to the right marked field, and we see that the fields that contain text such as Genre and Title have the icon next to them indicating they are text. The items that contain number information like Rank, Rating Tomatometer and No. of Reviews have the # sign next to them, indicating they are numbers and all the Production Country fields have the globe icon next to them indicating these fields hold geographic data. 
  • We've made the modifications that we want to make to how Tableau identifies the types of data in our sheet, so let's move on to the worksheet. But remember, if we want to change anything about the actual data being used in the graph, whether that's changing the data type, adding in a calculated field or creating a relationship between different columns in two sheets, one of the places we can edit this is the Data Source button at the bottom of the Tableau interface. 

The Worksheet Interface in Tableau

  • Select the box where it says Sheet 1 at the bottom of the screen and it will send us into a blank Worksheet interface. Worksheets are where to make graphs. At the bottom of the Tableau interface will be tabs that will let us toggle between different worksheets, dashboards or stories, as well as icons to create a new version of each of those kinds of documents. We'll be dealing with worksheets in this tutorial. 
  • To the left of the screen is the Data that exists for our sheet, underneath Tables are the different Dimensions and Measures that make up the data that is on the spreadsheet we've connected. There is also a tab for Analytics where we can add trend lines or other kinds additional reference lines to graphs.
  • This is where we can see which Dimensions or Measures are available to be displayed in the rows or columns of the chart or graph.
    • Dimensions are qualitative data, so mostly words (or strings in Tableau's terminology). They are the categories, classifications or other information that give the measures meaning. Dates are grouped with Dimensions but can be used as Measures as well. Dimensions  are coded with blue boxes and symbols and appear that the top of the data available under Tables. For this sheet they are:
      • Genre
      • Production Country 1-11
      • Release Date
      • Title
      • Title Without Year
      • TMDB ID
      • Year
      • Measure Names -  a field generated by Tableau which can be used for labeling
    • Measures are quantitative data, meaning columns that can be summed, added, subtracted, averaged, etc. Nearly any kind of numeric data is a measure, with the exception of, for example, an ID number. A sum of everyone’s MC ID numbers wouldn’t really tell us anything, so it isn't really measuring anything, so it's a dimension. Dates fall somewhere in between as to whether they tell you something based on their numeric quantities so Tableau will let you treat dates as either a measure or a dimension. More on that next section. In Tableau, Measures are coded with green boxes and symbols. The Measures available to be added to this sheet are at the bottom 
      • No. of Reviews
      • Rank
      • Rating Tomatometer
      • Latitude (generated)
      • Longitude (generated) - These two measures are generated from the country names by Tableau.
      • top_100_movies_by_genres  - a field generated by Tableau that will count the number of rows in the sheet that contain a given measure or dimension. Whenever we want the graph to display how many rows match a certain criteria, this is what we'll add to your worksheet. 
      • Measure Values - a field that adds all measures as a group to the graph.
  • The next column over contains another group of boxes or cards where we can control what information winds up on the graph. 
    • Pages is a function where we can add digital pages to your graph, which could be useful if we wanted to say, see what the graph looked like for different months of a year or when broken out based on different Dimension values (we'll use later when we want to see how the graph looks for movies of different genres). When you can do this you can flip through digital "pages" to see how the graph differs based on when that Dimension changes. We'll be getting to that later in the tutorial.
    • Filters is the next box down. This is what to use to only deal with certain ranges of the data. So if we had say, air quality data for the Bronx, and we wanted to only look at the summer months, we could filter the graph so it only included data for the months of June, July and August. 
    • Marks is the box to use to control what kind of graph we are making (scatter, bar, line or other options) and lets us use color, size or shape to signify different aspects of the data. This is also where we can control what aspects of the data will show up as labels or when people click on a section of the graph. 
  • In the middle of the screen is the canvas where we will see how the graph that we are creating looks. Above that are the Rows and Columns containers where we can drag in the Dimension and Measures that we want to be displayed as the row or column section of the chart. Above that section, there are also icons that will let us swap which bits of data are the row and which are the column, that will change how it is sorted, that will control how the graph is made to fit to the display, and other options.
  • Finally, at the top of this interface is a box marked Show Me.  When we click on it and hover over the picture of the type of graph we want to make, information will appear the the bottom of the window letting us know what kind of information we need to add to the graph to accomplish to that. Whether the data to use is Dimension, Measure or both and how many of each is needed. 

Now that we have some idea of what functions on Tableau tie to what outcomes on the graph, let's make some graphs!

It’s a good idea before you start hauling over measures or dimensions to first come up with a mental image of what you want your graph to look like. Not what you actually expect the data to tell you, but how you want it organized. If it’s a line chart, what measure is horizontal and what is vertical? Is one of your measures numeric and the other non-numeric and so something like a bar chart makes more sense? You can do this in your head or just sketch it out on scrap paper.  Tableau isn't always the most user-friendly so it's better to have some idea of how you want your graph to be organized.

Creating a Simple Bar Graph in Tableau

Let's start with a simple bar graph to see which genres of movies have the most user reviews. I've done a little thumbnail sketch for myself to think about how I want the graph set up and have decided that the genres will be on the lined up horizontally with the amount of reviews that they have in total determining how tall the bar is vertically. 

  • In the column that says Tables at the side of the screen, click on where it says Genre and drag it over to the Columns container at the top of the screen. This will make it so the data that pertains to each genre is held within its own column. A preliminary form of this chart appears, but we haven't told it what measures for Genre that we want to display in the rows so Tableau is just displaying a placeholder version where the different Genre types are broken out into columns.
  • The kind of numeric data that we want for each of these genres in the place of those placeholder Abc is the No. of Reviews. Click on that item in the Tables box and drag it onto the Rows container at the top of the chart.  
  • On my sheet it automatically sets up the graph as a bar chart with the Sum total of the number of reviews for each genre displayed as how high the bar extends vertically. If your displays as something else, you may have to select the Bar options manually from the dropdown on the Marks card, by clicking on the arrow next to Automatic and finding that option.
  • By default, the graph lists the genres in alphabetical order, rather than in order of how many reviews each had. From a quick look, it appears that Drama movies have the most reviews, followed by Action & Adventure while Television and Classics  are at the bottom, but if you wanted to summarize the findings based on exactly how they were ranked, you might want to sort it in numerical order. 
  • Tableau includes a couple of ways that you can change how a graph is sorted, one is an icon at the top of the interface, another is an icon that shows up when you hover over the vertical axis. I'll choose the one that allows me the most customization,  I'll click on Genre in the Columns container and click on Sort on the menu. 
  • That opens a Sort menu. I'll select that I want to Sort by Field from the dropdown and since I want the Genre bars to be sorted in the order of having the most to the least number of reviews.  I'll make sure that No. of Reviews is selected in the dropdown next to  Field Name and the Sort Order that I choose is Descending. Thankfully this window will preview the changes that you are making on the graph below it, so you can see right off if you are getting what you want or not. There are other options you can choose, one of which is Manual so that if you need to you can customize the order.
  • Use the X at the corner of the window to get out of the Sort window. Instead of being ordered in alphabetical order the bars are now sorted into the most to the least number of reviews from left to right, which can give a better idea on which genres' top 100 movies are reviewed a similar amount of times. 
  •  Now you may have noticed that where it says No. of Reviews in the Rows container it actually has SUM written at the front. This is because the graph is displaying the total number of reviews for all the movies on the list for a given genre, but there are other options you can display here. Click on the Sum(No. of Reviews) slug and by hovering where it says Measure (Sum)  you can see the other options for the numeric data it can display. Try out Median and Average to see how it changes the graph. Since in this case most of the lists have the same number of datapoints, it doesn't change it too too much, but if you had a reason to want to look at the average or median measures for a figure rather than the sum total this is where you would change that. After you poke around, switch back to Sum.
  • There's a large variation between the highest and lowest number of reviews per genre, and sometimes when there's such a discrepancy that can mean you should explore further to see if there's any missing data or duplication going on. For each of the bars (or any of the kinds of marks we'll have on later graphs), you can see more information about the data that makes up that representation on the chart by clicking on that bar. Let's start with Television at the end, since it is the lowest. 
  • It gives you at first the same summary information that you can get by hovering over any of the bars, but to see more, you can click on the series of bars at the far left corner of that stub to View Data
  • It will take you to a summary window, and you'll need to click on the Full Data option at the bottom to see all the data that makes up this bar on the graph. You can scroll through this the same as you would have been able to look through the spreadsheet, but one thing that jumps out right away to me is the fact that it only lists 57 rows at the top, despite the fact that the data that I originally downloaded was for the Top 100 films for each genre.
  • Click out of the window. So now we know that one of the reasons that the Television bar has so many less total reviews is because it included a lot less films. How many of the other genres with smaller numbers of total reviews also have smaller numbers of movies in them than 100? Let's click around and see.
  • By following the same process, clicking on the bars for genres with a smaller number of reviews, and choosing the tab to view their full data, we see that Television isn't alone in having less than 100 movies. Sports & Fitness only has 69 rows, Western only has 86, but Classics despite its smaller number of reviews does in fact have 100 movies on its list. How might we change our graph to reflect the information that not all genres have 100 movies listed for them? In the next section, we'll use a filter to narrow down what we show in our graph, but first let's rename this sheet we've made
  • Double click at the bottom of the page where it says Sheet 1  and change the text to say Genre&Reviews-All
  • The title at the top of the graph will automatically change to match the new name you've given this sheet, but let's change it to something clearer. 
  • Double click at where it says Genre&Reviews-All at the top of the graph and it will open the Edit Title  window. Automatically it is populated with only something that says <Sheet Name> 
  • I'm going to title it Number of Rotten Tomatoes (RT) Reviews for Top RT Movies by Genre and select Apply. The title will change. To note, this is the window that you can use to choose the font, color and size that the graph of your title appears in, but in my case, I'm not going to play with that.

Filtering Your Graph in Tableau

Sometimes you may want to exclude parts of your data from a project. Maybe you're creating a graph where you only want to look at a certain portion of your data to really focus in on it. Maybe some of your data is incomplete and you only want to include the stuff that meets certain criteria so that you're sure you're making fair comparisons. In either scenario, Filters can help you create graphs that only looks at portions of your data. 

  • We're going to create another version of our graph showing the total number of reviews per genre and filter it, but we don't need to complete all the steps to create this graph all over again, we can simply Duplicate it, which will just fully recreate this sheet in a new tab so we can customize it in a new way.
  • Right-click on the tab for this worksheet -  Genre&Reviews-All and choose Duplicate. This will be a process that you will repeat quite a few times throughout these tutorials since it's a really neat way to easily try out different configurations on a graph without either erasing the one you've already made, or having to remake it from scratch.
  • In its tab, rename this sheet Genre&Reviews-Filtered so you can tell it apart from the earlier one. 
  • Since Filters are a way to either make sure to include only certain sections of your data, or exclude certain parts of your data, this is where we decide what belongs on our graph. If we decide that Television films in our graph about movie just don't fit, we can just click on the bar for Television and in the middle of the pop-up window that appears, choose Exclude. 
  • When we do this, two things will happen: one, Television will no longer be one of the genres included in the graph and two, in the Filters box on our chart, Genres will appear. Even though you didn't explicitly create a Filter, by choosing to exclude, you were actually using that tool. In Tableau, there are often multiple ways to accomplish the same thing. 
  • Click on the Genre bar in the Filters box to open the Filter window for this field. It opens on a list containing all the options in the Genre field. Note that the Exclude box is checked, and none of the items on the list are selected, until you scroll down to Television, and you'll see that box is checked and there is a line drawn through the name. With this menu if you were looking to only include or exclude certain Attributes by name you'd be able to do so by scrolling down this list and checking their boxes . A summary of what's been selected is at the bottom of the window. 
  • Next, let's make sure that we're only showing the Genres on this chart that have at least 100 movies in their category so we're comparing apples to apples. If there aren't 100 movies' reviews being totaled up for each of these genres, how do we know if the larger numbers for some of these genres is based on whether they have 100 movies on their list, or if that genre is truly reviewed more often on the site than other genres. One way we could exclude all genres with less than 100 movies, is go back to our data and make a note of all the genres that have less than 100 movies in them, or we could do something much simpler and more foolproof, create a filter to only show a Genre on the chart if it has 100 rows in it. 
  • Select the tab that says Condition in the Filter window.  Choose the radio button next to  By Field:  which will let you only choose certain fields. In the dropdown below it, select the field top_100_movies_by_genres, which is a field generated by Tableau that will counts the number of rows that fall into a particular category. Leave Count and =  selected in their dropdowns and type in 100  as the Count that you'd like. To translate this set of dropdowns into more plain language, you're creating a filter that will only include Genres on the condition that there are 100 rows of it. Select Apply and you'll see the graph change to drop off several of the shorter bars, which must have been for Genres with under 100 movies in them. 
     
  • There are different conditions you can add depending on the field you want it to filter based on, and it doesn't even need to be a measure that you have added to display on your graph.
  • If we wanted to only see the number of reviews for Genres where the minimum on the Tomatometer was less than 80%, you can change the field to be Rating Tomatometer, choose Minimum on the dropdown, select < from the dropdown next to it and type in .8 and you'll see the graph change to only display the bars for Action, Science Fiction, Horror, Western and Sports & Fitness. So you'll know that these are genres that the equivalent of a B- is good enough for Rotten Tomatoes to include it among the Top 100 for the genre. Interestingly this includes both one of the genres with the highest number of reviews, Action & Adventure and one of the ones from the lower side of the spectrum for reviews, Sports & Fitness, so this can't simply be chalked up to being either a genre that doesn't get a lot of critical attention or one where only a small number of reviews are recorded. 
  • Let's switch this chart back to previous configuration where it was showing only genres with 100 rows. Follow the configuration above or just use Ctrl-Z for Undo. Tableau lets you move back extensively using Ctrl-Z so don't worry, if you do something that it turns out will mess up your graph you can go back with a minimum of headache. Retitle this sheet to fit the filter you've put on it: Number of Rotten Tomatoes (RT) Reviews for Top 100 RT Movies by Genre

Please note, filters will only be on the chart you put them on and don't transfer over to any other charts that you make from scratch. You can choose to duplicate a worksheet if you want to carry over your new filters to another worksheet.

Adding More Information to a Graph and Customizing it

in the last section we made a graph comparing top genre films in terms of how many reviews were up on Rotten Tomatoes for each genre and saw how to use filters. 

Now let's create a chart that compares two measures, runtime and number of reviews. We'll also break this out by genre, but we're looking to see if it seems like these two things are related. A hypothesis we might have about these two measures is that movie genres with longer average run times might get less reviews, since they are more of a time commitment to watch, but lets see if the numbers bear that out with the hypothesis. 

  • We're starting from scratch this time so click on Worksheet at the top and choose New Worksheet
  • We want Runtime (in minutes) to be the data that's displayed along the bottom of the chart, so drag that item into the Columns container, and we want No. of Reviews  to be displayed along the vertical axis of the chart so click on that menu and drag it into Rows
  • Let's break those lines out by Genre, and use Color as the way that different Genres are marked on our chart, by clicking on Genre and dragging it on top of Color on the Marks card. Whenever you want a given measure or attribute to be differentiated in a certain way in how it is symbolized on your graph, you'll be dragging and dropping it to the Marks card over the icon for whatever aspect of the graph you want to use to display it, Color or Size or Shape. Or you can choose to add the field's information as a Label or for display on the Tooltip you see when you hover over a mark on the graph. 
  • Speaking of symbology, by default, Tableau gave me a scatterplot where each genre is represented by a different shape. In this instance, I'd rather see this as a bar graph, so I'll click on the dropdown on the Marks card and choose Bar
  • Also, by default, Tableau chose to give us the Sum of each measure, which gives us kind of a nonsense chart, since the sum total of every movie's review and every movie's runtime just gives us a really high measure for both of those items, and since I know that there are some movies that have really long runtimes and some that have really high amounts of reviews from looking at my data, I know a graph with the sum total of each  measure isn't going to tell me a lot about how much the average movie in a genre has been reviewed or how long the average movie is, so I'll want to apply the Average calculation for each measure. 
  • Click on each of these measures in the Columns and Rows holders respectively and switch their Measure  to be Average instead of Sum. The numbers on the axes and the length of the the bars shift to reflect this new dimension. However, the bars are still clustered on one side of the chart which can make it kind of difficult to see where each genre is located since they overlap. Let's take a look at adjusting these axes.
  • Double-click on the bottom axis and it will open the Edit Axis window. This is where you can do things like change the name of the axis, determine the range that it runs and where the tick marks are to tell the viewer what they are looking at.
  • One of the very first things to clue us into why the bottom axis looks so crowded, is that the box next to Include Zero is checked. This means that where the axis for the runtime starts is 0. Now in some cases you might want your graph to start at 0 if the data you are covering occasionally would include a 0 but in this case, since what is being graphed is the average runtime of a movie...and a movie cannot have a negative runtime, 0 would not be represented. So uncheck this box, and you'll see the graph shift. 
  • The axis not starts with 95, and it's a little easier to see what the differences between each genre is, though there is still some overlap. Just to get the numbers a bit more orderly though, let's change the numbers for the start and end of this graph to give a small buffer.  Currently the Range is set to Automatic. Click on the button next to Fixed, and pick 89 as the Fixed Start and 130  as the end. 
  • It's a little easier to see how the runtimes are distributed, but let's make the graph a bit wider. Currently, it's only taking up part of the screen and that's making the bars overlap quite a lot. At the top of the interface is a little dropdown that says Standard, click on it and choose Entire View and then the graph will stretch to take up the entire canvas and it's easier to see how the different runtimes and number of reviews are distributed. 
  • Since what is being measured at the bottom of the graph is the runtime in minutes, any reader will have to do the mental calculations of whether these measurements are an hour and a half or two hours since we can't change the way the numbers are formatted to display in hours and minutes. But what we can do is Add a Reference Line
  • Reference Lines allow you to put a line jutting out from one of your axes indicating a certain value. In this case we'll be adding it to show when an hour and a half and when two hours falls on the graph, but it can have all kinds of uses. Maybe you're looking at recorded temperatures for a given month and want to add in indicators of where the high, median and minimum fell in a different decade. Maybe you are making a series of graphs to see how men vs. women are represented in the artists that different museums have had on display over time and you'd like to put a reference line at 50% across all the graphs to better be able to see how close each have come to an even divide.
  • In this case you'll be adding it as a reference for people to use to see how close these times are to a more general understanding of how long a movie is. Right-click on the bottom axis and choose Add Reference Line from the menu. 
  • Leave selected that you want to make a Line from the menu at top. In the dropdown next to Value choose Create a New Parameter. Create Parameter window will pop up. For Name, write in An Hour and A Half, since that's the text you'll want to appear next to the reference line that you are adding. For Current Value, write in 90 since that's the number of minutes that make up an hour and a half. 
  • Once you select OK, you'll see a gray line appear on your graph. Repeat the process and create a new parameter named 2 Hours that is set to 120. 
  • Let's name this graph and change its title to match what it is now showing us. The name I'll choose for the tab is AvgTime&ReviewByGenre and at the top I'll title it Average Runtime and Number of Reviews by Genre.

From a look at this graph, it doesn't look like my hypothesis was verified. There are higher and lower bars (indicating a higher and lower number of average number of reviews) dotted along the horizontal axis that indicates how long these movies are on average. After all three of the genres with the highest numbers of reviews are all clustered near or past the two hour mark. It doesn't really indicate an inverse relationship (where longer runtimes means a larger number of reviews), since there are numerous places where movies with similar average runtimes have very different numbers of average reviews. Horror movies and Kids & Family films have an average runtime that differs by less than 1 minute, but their average number of reviews differ by about 20%. But an average can be dragged up or down by outliers, so let's take a more in depth look at how exactly those runtimes are distributed within a genre with a Box and Whisker Plot. 

Creating a Box and Whisker Plot

The worksheets we've created so far have looked as a genre's measure as a whole, whether the measure being graphed was its sum or the average for all movies in a genre. They haven't displayed individual movies within those genre's lists' measures, so while this last graph showed averages, we don't know if there are outlier films with very short or very long runtimes that are dragging that average up or down. A Box and Whisker Plot is one of the possible graphs you can use to have a data point visualized for each of the individual items that make up your dataset (in this case it will be a movie, but it could just as easily be a person, a temperature reading, a book, a year's population data for a city, whatever individual row of information you have in a given category).  A Box and Whisker Plot is good for when you want to see what the distribution is of a different measure within a certain category, in this instance what the runtimes for movies are in a given genre list. Are they clustered together around the same median point, or are they spread out more evenly? We'll also be showing another display option you can use on your Marks card to add more information on your graph.

  • Start by opening a new worksheet by going to Worksheet from the menu up top and choosing New Worksheet (you can also do this with the icon at the bottom that looks like a worksheet with a little plus next to it, or by using Ctrl+M)
  • Click on Genre and, like in our first graph, since we want this to be the measure that appears along the horizontal axis, drag it into the Columns container
  • Click on Runtime and drag it into the Rows container, since we want this the measure to be displayed on the vertical axis
  • Right now, it creates a bar graph, since it's displaying the runtime as an aggregate sum for each genre. We want each individual title's runtime to be displayed, so let's click where it says Title and drag it onto Detail on the Marks card, so the bars on this graph will be made up of individual segments with the details for each Title
  • Each of these segments is of different sizes depending on the runtime of the movie and some are larger than others. You can hover over any of the segments and it will tell you the movie's title and runtime. The movies are in alphabetical order rather than in order of how long they are so some differences are already pretty visible with different thicknesses appearing next to each other as shorter and longer movies are nested on top of each other. 
  • Let's convert this to a Box and Whisker Plot by going to the Show Me menu up at the top of the screen and clicking to open it. We haven't explored this menu that much, but it's a way that you can choose different kinds of graphs to display your data. The option you are currently using has a red box around it. Any option that is highlighted is one that will work with the data that you have already added to your worksheet. Any that are greyed out are ones that doesn't work with your current data you've added but if you hover it, it will tell you how many of each data type you'd need to use it. We'll be selecting box-and-whisker-plots, which is the second from the bottom on the right hand side of the options.
     
     
  • When you click on it you'll see that the graph instantly changes to use the box-and-whisker mechanism rather than the bar graphs.
  • We'll be changing some aspects of this graph to display more information, but the basic mechanics are: each dot represents a movie, and how high or low it is placed depends on its runtime.
    • The box (the two-tone grey rectangle) of each plot represents the movies that occur around the median or  the IQR (Inter-Quartile Range) 
    • The darker grey section encompasses the 25% of the movies that are shorter than the median runtime, and the lighter grey has 25% of the movies that are longer than the median runtime.  
    • The whiskers (the long black vertical lines that extend from the box and end in a horizontal line) mark all the movies that have runtimes that are within 1.5 times the span of the box.  This goes both higher or lower than the outer edges of the box 
    • Any dots that occur below or above the marks are outliers. 
  • For example, for Kids & Family the median runtime was 100 minutes, so that's the line in the middle of the box. 
    • 25% of the movies with longer runtimes than 100 minutes had a runtimes between 100 minutes and 105.5 minutes so the upper hinge of the box stopped there.
    • 25% of movies with shorter runtimes than the median had runtimes under 100 minutes but at least 90 minutes, so the lower hinge of the box stopped at 90 minutes.
    • The box has 50% of all the movies in it, and the difference between the shortest and longest of them is 15.5 minutes. 1.5 times 15.5 is 23.25 so the upper whisker stops at the movie closest to 128.75 (105.5 + 23.25) minutes at the top (in this case Hugo at 126 minutes) and  the lower whisker stops at the movie closest to 66.75 ( 90 - 23.25) minutes at the bottom (in this case Bambi which is 70 minutes).
  • What that means to you non-math folks (like me)  is movies where that box is shorter, have a denser distribution of runtimes where most runtimes are near the median. They also have smaller boxes and smaller whiskers. Look at the plot for Animation to see an example of this. It has a lower median runtime than nearly any other genre, and its runtimes are so tightly distributed that its upper whisker is below many of the genre's boxes. Meaning the movies on the longer than usual side for runtime for Animation are lower than the high end of the usual range for most of the other genres. Animation has a smaller box where both the top and bottom seem nearly equally distributed, which seems to indicate a consistent duration for films on the Animation top 100 list. 
  • On the other side, of it, you can take a look at the movies for Classics. In this one the higher end (light grey) of the box  containing the usual range for this genre extends much farther than the lower (dark grey) end, indicating that 25% of the remaining movies that are above the median runtime for Classics of 104 minutes have a much wider variation (up to 125 minutes!) before they hit 25% of the movies over 104 minutes, than ones that have a runtime under 104 minutes.
  • Right now this graph is telling us about how runtimes are distributed within genre, but not a lot else. Let's add more information to it. One thing we haven't dealt with yet is how the movies are ranked. Like are highly ranked Horror movies, ones that are on the shorter side of runtimes or are they the ones on the longer side? Or does runtime not matter at all? We don't know from looking at this graph, so let's add symbolism to show what the rank is for each of the movies on this graph is. 
  • Click on Rank and drag it on top of Size on the Marks card, and you'll see each of the dots become different sizes, but they don't seem to be dramatically different which can make it hard to tell exactly what a given range of dots' rank is. Also when you look over at the legend on the left-hand side of the chart you'll see something that should give you pause, which is movies that have 1 as their rank are the smallest dots on the chart, while the ones that are ranked as 100 (i.e. nearly didn't make the chart) are the largest dot. That is a system that would confuse the casual viewer of the chart who'd probably think that a larger dot would correspond with a larger importance to Rotten Tomatoes's reviewers. We're going to need to do some editing to fix this. 
  • Select the arrow on the side of the legend marked SUM(Rank) and choose Edit Sizes. In the Edit Sizes window that appears, switch the dropdown for Sizes vary:  to select By range and choose the check box for Reversed so that it's actually the dots on the higher side of the spectrum that are the smallest.  Next, drag the right arrow on the range in Mark size range  bar over so that in the Sample legend: on the side the dot for 1 becomes markedly larger than the one for 100. Choose Apply and you'll see the graph shift. To be noted, it is the center of each of these dots that indicates what its runtime is. 
  • It can be hard to see the individual differences between the dots, especially now that they are bigger. Click on Color on the Marks card and on the Border dropdown choose the black color swatch, which will put a black border around each of the dots. 
  • Just to make this whole thing easier to read, go to the dropdown at the top of the screen that says Standard and switch it to say Entire View so that the graph takes up the entire work canvas. 
  • Looking at the graph with this new information added, we can now see if there are any trends in the runtimes of the the higher and lower ranked movies within the genre. For instance, in Horror where there is a very tight distribution around the median, there are higher ranked movies on either side of it, though medium ranked movies right on it, while in Musical and Performing, with a larger range of movies in its middle 50%, a lot of the movies near the median are the higher-ranked movies. 
  • It's also interesting to see how rankings fare amongst the outliers. For instance in Action & Adventure, the outliers are all highly ranked and all on the longer side of the runtime scale, whereas, for Kids & Family, the two outliers on the shorter runtime side are both highly ranked, but the ones on the higher side of the runtime scale are all lower-ranked. 
  • On our last graph, we were also looking at the No. of Reviews associated with a title to see if there was any kind of relationship between the number of reviews a movie had on the site and its runtime, and while we were dealing with the average of both there didn't seem to be a relationship, but at the risk of making our graph too busy, let's see what adding No. of Reviews to our Marks card yields. Let's add it to Color. 
  • Click on No. of Reviews and drag it on top of Color on the Marks card and you'll see the colors of the dots shift. A legend appears on the side letting us know that dots that are a darker blue have a higher total number of reviews and dots that are a lighter blue have a lower total number of reviews, and I'm afraid that this seems to be the last straw for my hypothesis of shorter runtime = more reviews, at least in some of the genres. The plot for for Science Fiction & Fantasy seems to be especially dismissive of that hypothesis, it's actually the movies on the higher section of the runtime that have the higher number of reviews. The new display also seems to make clear that some genres (even when drilled down to the individual movie level, rather than as a genre-wide average or sum) have much fewer reviews  on Rotten Tomatoes than others. 
  • This graph has the same issue as the previous one however, in that the runtime is listed in minutes, making the user do the calculation themselves on how many hours it is. Fortunately the parameters for  Reference Lines that we created on the last graph are ones we can easily add to this one. Right click on the axis for Runtime (in minutes) and choose Add Reference Line. Add the reference lines you've already created for An Hour and a Half and Two Hours and since this range is a lot wider than the one on the previous graph, add new parameters for An Hour at 60 and one for Two and a Half Hours at 150
  • Adding these parameters makes it a bit easier to eyeball that most genres have median runtimes somewhere between an hour and a half and two hours, with the exception of Action & Adventure and Drama. It also makes it easier to notice that only one genre, Classics has a whisker that extends below the hour mark, and only a few have more than one film that falls below the hour mark.  Think on what else you notice about different genres from this graph. Next, we'll be creating a scatter plot that lets us more directly compare how two of these measures, runtimes and reviews, relate. 
  • Before we create a new tab, let's rename this one at the bottom to BoxAndWhiskerRuntime and change its title at the top to be Distribution of Runtimes by genre, since that's the primary measure that it is showing. It's good practice to make sure that all your tabs have informative names and titles, so as you click through a workbook, you aren't faced with a sea of tabs at the bottom that just says Sheet1, Sheet4 and leaves you not sure what sheet is where. 

Creating a Scatter Plot

A scatter plot is a graph that shows where an individual datapoint appears in regard to two different measures. In this case, just like on the graph we just made, we'll be making the movies (or more specifically the Title) our datapoints, but we'll make the two axes the Runtime and the No. Of Reviews. So for each movie, the higher it is on the vertical axis, the higher the number of reviews it has, and the further it is to the right on the X axis, the longer runtime it has. If we see any clustering or any clear trend, then that will tell us once and for all whether there seems to be any relationship between these two things. 

  • Start by opening a new worksheet by going to Worksheet from the menu up top and choosing New Worksheet.
  • Since we want the measure across the X axis to be Runtime (in minutes), click on it and drag it into the Columns container in the top of the sheet. It says Sum and is aggregating all of the movies, but don't worry, we'll deal with that in a few steps. 
  • No. of Reviews is the measure that we want to have as our Y axis, so click and drag that into the Rows container at the top of the sheet. . This creates a weird situation where only one dot appears in the top right of the sheet, but we're about to fix that. 
  • Just like with the box-and-whisker plot, in order to get a datapoint for each individual movie, we'll click on Title and drag it on top of the Details option on the Marks card. You'll see a cluster at the bottom left of the sheet, but the runtimes look mighty strange, stretching to over 800 minutes, which we hadn't seen on the other sheet at all. Click on the furthermost circle, Seven Samurai, click to View Data, choose Full Data  and you'll see why.... the movie is on four different Genre lists, so the number for its reviews and its runtime has been multiplied by 4. 
  • Let's scroll through this data to see what fields we can use differentiate between different instances of this movie appearing in our data. The only unique fields for each row about this movie (since it's literally the same movie it has the same Id, Production Country, Release date, Rating Tomatometer and so on) are either Rank or Genre. Close the View Data window. 
  • The primary difference I'll choose is the one we've been including on all our sheets so far, Genre. This is the surefire way to make sure that there's a unique datapoint for the movie for each time that it appears on the chart, because, well, it's only going to be on each Top 100 of a genre list one time. Rank might also work but it isn't impossible that a very well-liked movie that fits into multiple genres could be say, #5 on more than one genre's list. 
  • Click on Genre and drag it onto the Colors box on the Marks  card to tell it to make each of these dots as a different color based on the Genre that it falls into. You'll see that the graph shifts dramatically, with the highest concentration of movies being somewhere between 80 and 120 minutes, and being between 40 and 70 reviews. But it is very densely packed, so it can be hard to see what exactly is going on. 
  • First we'll try to help this by changing the viewpoint of the graph so it takes up the whole screen. Click on the Standard dropdown at the top and choose Entire View to expand the portion of the screen that the graph takes up. This helps somewhat, but if we scroll over to find our test case from last time, Seven Samurai (located just under 210 minutes, and a bit under 100 reviews), we'll see that its kind of a muddy blue-ish gold color. That's because since it's located on several lists, the colors signifying the genres whose lists Seven Samurai is on are just set up on top of each other, oops. 
  • From our look at that movie we know that the other field we can almost certainly count on to be unique about each row for a film is its Rank,  so click on Rank and like we did on the last graph, drag that on top of the Size box on the Marks card, and the graph shifts again. But in the legend on the side, it's been set up to the same default as before where 1 is the smallest dot and 100 is the largest, so clearly we'll want to reverse that. 
  • Click on the arrow on the SUM (Rank) legend and choose Edit Sizes. Click on the dropdown next to Sizes vary: and choose By range. Next, click on the box that says Reversed and choose Apply and the graph will change to reflect this. 
  • Let's also add the reference lines for different runtimes that we made to better help us what different time ranges we are talking about here. Right-click on the bottom axis and choose Add Reference Line, to add the reference lines for An Hour, An Hour and A Half, Two and a Half Hours and Two Hours. 
  • By looking at our test case again, Seven Samurai, we can now see that it's an interlocking series of rings of different sizes that looks kind of like a target. In fact, there are a bunch of items that look like that throughout the graph. By hovering on them we can see that a lot of them are in a similar situation to Seven Samurai where they're representing one movie that occupies a very different spot on multiple lists, but some are ranked so close to the same on their different lists that it can be hard to see what genres make up the circle. Similarly, some are just different movies that have a similar enough runtime and number of reviews, and a lower rank, like the mess near 70 minutes and around 50 reviews that comprises multiple movies.
  • However, before we figure out another configuration let's see what we can tell about our data from this graph. Conversely to what I was beginning to think about there not really being any relationship between a movies runtime and its number of reviews, I believe I can notice a few consistencies. 
    • The densest configuration of runtimes is that between an hour and a half and two hours, and there are movies with runtimes falling within this all up and down the possible numbers of reviews
    • When runtimes movies go much past two hours, the number of reviews clusters at the top and bottom of the scale.
    • Movies also seem to be more likely to have fewer than 100 reviews when they go significantly below around 80 minutes and to a lesser extent, past two and a half hours. 
    • What have you noticed from this graph?
  • Let's move on to the next graph, but first, let's name the graph in its tab and change its title. Let's name the tab ScatterRuntimeReviews and make the title Runtime and Number of Reviews by Genre

As neat as it looks to have all our genres on the same graph (seriously, it looks like some 1960s shower curtain!), it isn't really giving us as clear of a picture as we need to make much more observations than that, so we have a few options we can exercise from here. One is creating a filtered version of this graph. The other is to use the Pages function so that we can separate the graphs out by genre and look at them one at a time.

Making a Filtered Scatterplot

Let's use the duplicate function to keep this original graph untouched and make one that only includes the data for the movies on the Horror, Romance and Science Fiction & Fantasy lists. Maybe this is for a project looking at these popular genres that aren't always taken seriously. Maybe I'm just curious because I would think that these three wouldn't have a lot of overlap and I want to see if that's true.

  • Right-click on the tab for this sheet ScatterRuntimeReviews and choose Duplicate. 
  • Rename the new sheet that appears as a result ScatterRomHorrScifi, and for the title at the top change it to Runtime and Number of Reviews for Horror, Science Fiction and Romance
  • Click on Genre on the Tables section at the side and drag it onto the Filters card
  • In the Filter window that opens, choose to select None, and then check the boxes for Horror, Romance and Science Fiction & Fantasy. Click Apply and you'll see the graph change to only show those options.
  • Take a look at the results to see what more you can notice now that it's been narrowed down to only these three genres. For one, it does look as though there is a little overlap.
    • There are movies on both Science Fiction & Fantasy and Romance (Groundhog Day,  and Shape of Water for two) lists, and a movies on both Horror and Science Fiction & Fantasy lists (Aliens and Bride of Frankenstein both being examples of that).
    • But I don't think that I found any overlap of the Romance and Horror lists. 
  • In terms of runtime it looks like Horror films are mostly concentrated on the lower side of the scale, with only one going past the two and a half hour mark, while both Romance and especially Science Fiction & Fantasy still have relatively numerous and highly ranked movies past the two hour mark and in the case of Science Fiction & Fantasy even a handful of highly reviewed and ranked movies past 2 1/2 hours.
  • Speaking of those reviews, a glance shows that there are more higher ranked Romance movies in the lower reviewed section, and to a lesser extent, Horror as well, whereas for Science Fiction a lot of the larger dots indicating a higher rank are in the top half of the graph, indicating they have a higher number of reviews. Let's do another filter to see if this is accurate by filtering this graph further to only show the top 25 movies from each list. 
  • Click on Rank and drag it onto the Filter card. You'll see that the window that appears looks different, since in this case you're filtering based on something that is a measure (or quantitative data) rather than a dimension (or qualitative data). Click where it says All Values and choose next. 
  • Leave Range of Values highlighted at the top, since you are going to tell it to look at only a certain range. In the two boxes 1 is in the left hand box, and 100 in the right. Change 100 to instead say 25, and select Apply. The graph will change again to only show that smaller amount of information. Additionally, the size configuration also changes, to only have the sizes go from 1-25 instead of from 1-100. 
  • Actually in this one that size for 25 is so small that I'll actually go into Edit Size to tweak it upwards. Feel free to do the same with the process you used to edit the size range originally. 
  • Now what can we see from this revised graph that only shows the top quarter of each of these lists?
    • Well, the surmise that top-ranked Science Fiction & Fantasy movies are also the highly reviewed ones seems to be borne out (If you'll look at the names of the movies by hovering over the dot, you'll notice that nearly all are comic-book movies, or sequels to legacy sci-fi movies) Less than a quarter of the top 25 sci-fi movies appear to be in the lower third of the graph, indicating a small number of reviews. 
    • For Horror, the opposite seems to be true, there are only 2 movies in the top third of the graph, and only one for RomanceRomance especially seems to be mostly in the lower third of the graph in terms of reviews. 
  • Let's change the title of our graph to reflect what's actually in it,  Runtime and Number of Reviews for the Top 25 Horror, Science Fiction and Romance, and move on to our next exercise. 

While Filters can be good to let you look at a smaller sections of your data, that might not be what you want to do in all situations. Instead of comparing genres directly to each other, with all the overlap that can create, maybe we just want to look at each genre in their entirety one at a time. Now, we could create a separate graph just for each genre, but that would be very tedious. Alternately, we can use the Pages function as we will in the next section to do this in just a few clicks. 

Using the Pages Function

Pages lets you create multiple views within the same worksheet, that you can move through using left and right arrows, just like if you were looking through pages in a book. We are going to set this up so that we can look at each genre on our scatterplot graph one at a time. Let's start from previous scatterplot that contains all the genres.

  • Right-click on ScatterRuntimeReviews and choose Duplicate. Immediately change its title in the tab to add - Pages to the end so you'll be able to tell the difference when looking through tabs.
  • We'll want to be setting up the Pages to each contain a different Genre's data so click on Genre in the table section and drag it over to the Pages card. 
  • Immediately, most of the circles for the movies disappear, except for those for Action & Adventure . A little box marked Genre with right and left arrows next to a dropdown appears the the bottom right of the screen. Click on the right arrow, and the graph will shift to instead display the next option for Genre, Animation. This box also has a little slider that you can use to move through the different genres manually, or you can use the play button just below the slider for it to play through all of the genres with a pause at each.
  • When you are only comparing a genre to itself, what interesting things can you see about them 
    • I'm still finding it interesting that a significant segment of the Horror list seems to not have many reviews, and large segment of it has below a two hour runtime. 
    • Western also resides mostly within the lower-reviewed section of the graph but its runtimes seem to be shifted over a half hour longer than Horror.
    • For several of the genres (Mystery, Drama, Science Fiction) there seems to be a cluster on the low-end of the review spectrum, and another cluster at top but with comparatively few movies in that middle 200-300 reviews range, while in other genres the movies are either lower reviewed in general, or the amount of reviews a movie got is distributed more evenly like Animation
    • What do you notice?

The Pages function can be used to sort things by a particular dimension, but also you can use it with a filter on it to actually only capture a sub-set of a given dimension. 

Adding a Filter to Pages

Remember how we noticed that some of the datapoints for movies looked like little bullseyes because of how many lists the movies were on? Let's create a pages function to only show the movies that appear on a certain number of lists and page through them. I'm curious to see which movies had these cross-genre appeal, and what genres they were in. 

  • Let's continue starting from the graph where we've already set up with the colors and sizes how we like them, and go back over to ScatterRuntimeReviews, right-click on it and choose Duplicate
  • Click on the tab of this worksheet and change its name to MultipleListMovies so that we don't lose track of what each of these tabs contains. 
  • Since we want to create a different Page for each of the movies that's on multiple lists, click on Title on the Tables list on the side and drag it onto the Pages card.
  • Nearly all dots will disappear because it's just literally created a page for every single movie. Click on the Title box in the Pages card and click on the tab that says Condition. Remember, this  Condition option is what you used before to only look at a Genre if it met the condition of having 100 rows in their list? So this is where you'll use to only include a Title if it meets the condition of being on 4 or more Genre lists. 
  • Click on the dot next to By field and choose the dropdown option for Genre because that's what you want it to look at, Count because you want it to count how many Genres a given Title appears on the list for. Select >= from the dropdown, and type in 4, since you only want it to include in a Title if there are at least 4 Genres it's on the list for. 
  • Select Apply.  A box marked Title with left and right arrows for you to use to maneuver through the Pages that Tableau has created will appear on the bottom right hand margin. There will be a slider to let you scroll through each of the titles that match this condition and a dropdown to let you look through the list. In mine, the first title that appears is 101 Dalmations, and the graph now only contains one interlocked circle. Select Okay.
  • We've solved the problem of having a lot of different interlocking circles on one map, however when we move through the Pages some of the genres that a title can fall into are still hard to determine as its rank might be similar on each list it is on.
  • Let's think about what the information that we want for each of the movies is. When creating our previous version of the graph, we noted that the only way we could differentiate between different instances of the same movie was Genre and Rank. Currently our graph has one axis set up to be the Runtime which of course...is going to be the same for each title regardless of the list its on, so let's replace that axis with Rank
  • Click on SUM(Runtime(in minutes)) in the Columns container and choose that you want to remove it. 
  • Replace it with Rank by clicking on that measure and dragging it into the Columns container
  • Next, since Genre is the other thing that will differentiate between each instance of a movie on its 4 or more lists, remove SUM(no. of Reviews) from the Rows container and replace it with Genre
  • Personally, I think it makes more sense for Genre to continue to be on the bottom as we've done on previous, so I'll swap the columns and rows with the icon at the top just to the left of the ones that sort axes.
  • I'll also switch it back to Circle using the dropdown on the Marks card because for me, Tableau switched to a bar graph for some reason, and since I think that provides a misleading picture, as a longer bar is actually going to be associated with a lower ranked film. 
  • The small dots look a little small for me, so I'll choose Edit Sizes and tweak the button on the Smallest end of the scale so that it makes the ones that are registered even as far down as 100 easily visible. 
  • Once you've made these alterations use the arrows on the Title box to flip through the movies on 4 or more lists. What do you notice? 
  • I've taken note of the fact that the graph for some movies consists of a lot of large dots towards the bottom of the screen indicating that it is ranked highly on all of the lists that it is on, whereas for other movies like Groundhog Day, it is ranked from 7 all the way to 62, depending on what list it is on. 
  • Take note and be constantly thinking about if your graph is actually displaying the information in an easily understandable manner. I think it makes sense to have both the size of the dot representing the movie, and its placement on the graph represent what rank it is on that particular list, but maybe you decide that for your reader, that would be too confusing. 
  • Change the title of the graph at the top to reflect what it now shows which is Ranks for Movies on 4 or More Lists

Saving Your Work

Now it has come time to save your work.

  • Go to File and choose Save to Tableau Public as
  • A pop-up window will appear asking you to login to the the account that you set up at the beginning of this tutorial. Log in to the account
  • A window will pop up giving you the caveat that you should make sure that this graph doesn't contain private or confidential information as it's going to be on the Tableau Public website and asking you what you want to name it. For this data the bit about it not being private doesn't matter, but note that if you are dealing with any private data, you'll want to either use a different program, or spring for one of the other Tableau options that is not free, like Desktop or Server. 
  • Give the workbook a name, I'll name it RottenTomatoesTopMovies-Lesson1 and choose Save.
  • It will process the request and then when it is done, it will open the workbook you've made in a new tab of whatever web browser you have open. Whenever you save your work a new version of this window will appear with whatever changes you have made. 
  • To note, it will only display the actual worksheet that you have open in your web browser so if you want to see one of the different ones in the Tableau Public websites, you'll need to click into that worksheet and choose Save to Tableau Public. This seems to be a more recent change, so though it is cumbersome, maybe it is just a temporary bug. 
  • If you want to use an image of one of the graphs that you've made in a paper or slide presentation, you can either take a screenshot or use the Download  option, which is the third icon from the right of an arrow in a box on the website, and choose to download as an image or PDF or Powerpoint, whatever format will help you. Note that all of these things are going to be static, however, and not be interactive. 

Here's the link to my workbook that you can go to and download my workbook I made in this tutorial if you got tripped up at any point and want to see how I made it work. In fact, a nice thing about Tableau is that often you can download existing graphs that you like and in a lot of cases see how people whose graphs you'd like to emulate, set up their graphs. Check out their gallery for examples of the work that some people have done (of course, keep in mind that some of them might be using Tableau Desktop or one of the other paid options, so you might not be able to duplicate every neat graph that you find).

Click to the next tab to work more with Tableau Public, but this next lesson will be focused on graphing data where time is one of the dimensions.

Lesson Goals

In the previous tab of the tutorial, we made some basic graphs with the statistics of data about Rotten Tomatoes' top-ranked movies in different genre and we stuck to looking at those movies' rank, runtime and number of reviews. Now we'll explore different ways to look at the release dates of these films, using a variety of different methods and graphs. 

Dates are a little strange in Tableau as they can be treated in more than one way, either as according to its values(quantitatively) or its parts (qualitatively). Think of it as the difference between wanting the graph to consider the year 2000 as a numeric measure that occurs between the year 1999 and the year 2001 or as a category that a movie can fall into, similar to its genre or production country. 

Getting Started

  • If you have skipped to this tab and didn't do the first section of the exercise, follow the below instructions.
    • Go to https://public.tableau.com/en-us/s/download and download Tableau Public. After it has finished downloading, install it. 
    • Visit the Sign Up page and create a login for yourself, and make sure that the password is something that you will remember. You must create a login as you will be logging into Tableau every time you save your work for the site when you are using Tableau Public. 
    • Download the below excel file, RottenTomatoesTop100Movies and open it in Excel or in Google Sheets. Save it somewhere you can find it again easily, since you'll be opening it again shortly in Tableau.
    • Go back to the previous tab and read the Getting Started section regarding what this excel file contains and the Worksheet Interface section in the Different Kinds of Charts lesson, then come back here

Connecting to Your Data in Tableau

  • Open Tableau Public from the start menu. It will open up on a menu asking you what file you want to connect to. The file that you connect to is the the file whose data Tableau will be using to create your worksheets. Chose Microsoft Excel under the options for Connect to a File.
  • In the Explorer Window, navigate to where you have the RottenTomatoesTop100Movies sheet, and select that you want to open it. 
  • Take a look at the information underneath Fields and in the preview section to make sure it makes sense for what you expect, that it's the correct data type. In this case, it should be all set. 
  • Click on the Sheet 1 tab at the bottom to open a new sheet in Tableau

Graphing Dates as Values or Parts

To see what it means that in Tableau, all dates can be graphed based on what they signify as a numerical value and what each of their parts signifies as a category, let's build a graph or two. 

  • We'll start with one where the release dates of the movies on our list as the horizontal axis (bottom of the graph) and the number of movies on lists that are from that year as the the vertical axis (side of the graph). As we established in the last section, since we want Release Date to be the horizontal axis, we'll drag that into the Columns section. The measure corresponds with the number of rows is top_100_movies_by_genre, so drag that into the Rows section.
  • A bar graph will appear, and it stretches past what we can see without using the bar at the bottom to scroll, so go to the top of the work sheet, click on the dropdown that says Standard and switch it to Entire View. The graph will contract so that the entire date range is contained in the canvas without the need to scroll.
  • From this diagram, it looks like the amount of movies from a given year on the various best of genre lists started climbing pretty precipitously upward beginning in around 1999. Let's use this pretty basic graph set up to see what selecting different Date options 
  • At the top of the graph YEAR (Release Date) is colored blue, meaning it is a dimension that deals with a date as its component parts as categories, not as its as values. To get an idea what that means, click on the Sort icon (the pyramid with the downward facing arrow in the middle of the icons at the top of the sheet) and see the change that happens on the graph.
  • Instead of the years on the axis being organized in alphabetical order, the bottom row is now organized from least amount of movies in a year to most, similar to when you used this sort function when the column field contained Genre, a text field.  
  • To see how this would be different if Year was treated as a value, click on the line that says Year in the Columns container and see all the different options for how to display it. Note it appears almost like there are duplicates since Year, Quarter, Month and Day are in both lists, but at a closer look you'll see that in the bottom section, the measures have a year attached to them, indicated they're tied to numeric value of the date, and in the top section, they are not, indicating it's the part of the date that is being measured, so just May, any year's May as opposed to May 2015.
  • But to see more specifically what this means, click on Year in the second set of options, to see what happens to this graph when you switch it from visualizing the Year of a Release Date as a value instead of a part.
  • The graph switches back to a line graph that shows the years in chronological order, and the option to Sort is now greyed out.
  • There isn't a good or bad way to decide to show a date, it all matters on what it is that you are trying to show. In this last case, if you'd wanted to be able to state for your paper or presentation what the top 10 years were for appearing on best of movie genre lists, you'd want that first graph that we made where the year was treated as qualitative data, so you could just quickly sort it and pull that information. If you wanted to be able to display how much more frequently movies from the last twenty years appear on these lists and what fluctuations there were over time previous to that, then this second graph is what you'd like to show.
  • But let's drill down to a more granular level to get a clearer idea of what this distinction between values and parts mean. First, as we've been doing, let's first rename and title this tab. Let's rename the tab by double-clicking where it says Sheet1 and changing it to Movies Per Year, and then doubleclicking on that when it appears at the top of the sheet and changing it in the Edit Title window to Number of Top 100 Movies Per Year
  • Now, let's just duplicate this worksheet since it's going to be easier than setting it back up from scratch, right-click on MoviesPerYear and choose Duplicate. Immediately rename this to MoviesPerQuarter in the tab and Number of Top 100 Movies Per Quarter at the top.
  • A shortcut that you can use when dealing with dates as a value to move down a level, from year, to quarter, to month, to week, is to click on the little plus sign either on the left side of the axis or on the YEAR(Release Date) box in the Columns container. Click on this, and you'll see the name in the box change to QUARTER and the graph get a lot spikier. When you hover over each of the points, you'll see that this is because it's not just summing up the number of movies in each period just based on a whole year any more but dividing it into which quarter of the year, i.e. 1988 Q2 which you'll see is as high as anything on the chart until 2008.
  • Remember though, this is the quarter being treated as a value, so it's being shown chronologically. If you wanted to know whether overall movies released in Q2 or Q3 had more of a lasting legacy than say, movies released in Q1, you'd want to treat quarter as a part of a year...any year. 
  • Click on Quarter(Release Date) in the Columns container. Note that currently the second option for Quarter is selected, click instead to select the one in the top section that doesn't have a year next to it., that will switch the color of it in the Columns container from green to blue.
  • The graph now swaps so that it only has 4 options at the bottom total instead of 4 (or less) points per year, Q1, Q2, Q3, Q4. It looks like it is Q2 and Q4 that contain the most movies, but let's take this down further, after all, both Memorial Day and Fourth of July Weekend are huge days for movie releases and they are in different quarters, so maybe I'm splitting up summer blockbuster season all wrong by looking at this with quarters. 
  • Let's duplicate our first graph again and look at our data per month. Right-click on the tab for MoviesPerYear and choose Duplicate. Immediately change the title of the tab to MoviesPerMonth
  • Click twice on the + sign on YEAR(Release Date) in the Columns container to see what that graph looks like. The lines get a lot spikier and contain a lot of peaks since now it's got (around) 12 separate datapoints each year. Let's switch it to Month as a part rather than a value to see which months are the most popular release dates for these top-ranked films regardless of the year they were released in. 
  • Click on the MONTH(Release Date) line in the Columns container and click on Month in the parts section, the one on top section that doesn't have a year next to it. The color will switch to blue in the Columns container and the graph will switch to only show 12 different sections that the graph can fall into, matching with each month. Switch the title to Number of Top 100 Movies Per Month
  • It does seem like the highest numbers for Q2 come from June and the highest numbers from Q4 on the last graph come from October. That makes me wonder if a lot of the October dates are from movies with a Genre of Horror and maybe the June ones are from Action & Adventure movies? Let's see!
  • Click on the dropdown on the Marks card and switch the graph type to Bar.
  • Click on Genre in the Tables section and drag it onto the Color box on the Marks card and now the bars are multi-colored with each different section representing a genre. At a glance, it does seem that Horror movies are indeed the genre with the highest number in October, but that Kids & Family, Suspense and Action & Adventure films on these lists also have that release month very often. It does however look like my hypothesis that the peak in June was fueled by Action & Adventure movies was wrong as Animation and Kids & Family seem to make up a much larger portion of the bar. 
  • You can check which genres have the highest quantity for a month by clicking on that month in the axis and choosing the Sort option to have it sorted so that the highest quantities are at the top or the bottom. I'll choose bottom. 
  • Note though that this will change how ALL of the bars in the graph are sorted, so you are probably only going to want to do this if there's a particular reason you want that month to sort as a template. By default the order that the dimensions appear on a bar legend and graph are alphabetical, but you can use the sort option, or drag and drop them manually on the Legends card if you have a custom order in mind.  
  • By sorting,  I'll see that contrary to what I would have thought for February, it's movies from the Comedy and Science Fiction & Fantasy that are the most represented from these lists with February release dates, not Romance as I would have guessed. Maybe people fondly remember Comedies they saw on Valentine's Day I guess?

Using Dates in a Scatter Plot

In the last lesson, I found scatterplots very helpful in seeing how two different measures were related. As you may have noticed when you were clicking on different data points in the scatter plot from the last time, a lot of the most reviewed films were also the most recent ones, but let's check out that hypothesis here. 

  • We'll be starting from scratch here. Click on Worksheet on the menu and choose Make New Worksheet. 
  • Let's make the bottom axis the Release Date and drag it into the Columns container. By default it assumes that you want to use it as a part rather than a value, but in this case we want to use it as a numeric value as a scatter plot requires 2 measures, so click on the dropdown in the YEAR(Release Date) line within the  and choose the second instance of Year from the dropdown list, and the color will switch to green, indicating it is a measure
  • Next click on No. of Reviews and drag it onto the Rows container, and you'll see it switches to a bar graph. Just for a moment before we convert this to a scatter plot, let's click back and forth between this and the MoviesPerYear tab since it did look pretty similar with a large spike later on in the data. I'm noting that in this new sheet, where what's being measured is the total number of reviews for all movies in a year, rather than all the movies on any of the lists that came out in a year, the difference is even more pronounced with it hitting a new height in 2000 that it had scarcely come even close to previously, contrast that to the measures for the count of movies per year which at least had a few years with a large # of movies on the list  early in the dataset. 
  • Okay, let's see what this looks like when we have a datapoint for each movie's entry on a list. Click on Title and drag it onto the Detail box on the Marks card and switch the dropdown to Shape and watch it shift into a series of circles. 
  • Next, let's make sure that there's an individual datapoint for each time a movie appears on each list, by clicking on Genre and dragging it onto Color on the Marks card
  • Finally, click on Rank and drag it onto the Size box on the Marks card, so the size of the circle will be determined by how it is on the list for a genre. 
  • Click on Edit Sizes. Choose that you want the Sizes to vary by range and check the box to reverse it so it's the highest ranked movies that are represented with the biggest sized circles, and let's take a look at the graph that we've made.
  • It looks like there are some films that are more strongly represented later in the graph (Action & AdventureDocumentary, Sports & Fitness, Television) and some that appear more frequently earlier in the graph (Classics, not surprisingly, but also a good chunk of Horror and Drama as well.) It seems like the rise of more movies being more highly reviewed begins in the late 1990s which makes me wonder when Rotten Tomatoes started. Let's add a reference line so that viewers of this graph will have the answer to that question. 
  • On Rotten Tomatoes' site, they say that the start of the site was August 12, 1998, so that's what we will go with. Right-click on the bottom axis, and choose Add Reference Line. To be noted, this is something you will only be able to do if dates are being used as values rather than parts.
  • Choose Create a new parameter from the dropdown next to Value. In the Create Parameter window that opens. Name it Rotten Tomatoes Goes Live, and give it the value of 8/12/1998 12:00:00 AM
  • Select OK and see that there does seem to be a rather large increase in movies in all time ranking lists after 1998
  • Let's name this tab and sheet, because its set up seems like it would be useful in other graphs, I'm going to name the tab YearAndNoOfReviews-Scatter and change the title at the top to Top-ranked Movies by Year and Number of Reviews
  • Tableau's duplicate function can allow you to recreate a graph whose format you like, so you can add tweaks and see how they affect what you can see about your data, without giving up your previous graph. Let's try that right now to see what this graph looks like when the factor other than date be being looked at for a movie is its Tomatometer rating rather than the number of reviews.
  • Right-click on YearAndNoOfReviews-Scatter  and choose Duplicate. We'll be looking at movies' Rotten Tomato ratings rather than their number of reviews, so let's immediately change the tab to say YearAndTomatometer-Scatter and the name of the graph at the top to be Top-ranked Movies by Year and Tomatometer %
  • Click on where SUM(No. Of Reviews) is in the Rows container and choose Remove from the dropdown menu. 
  • Click on Rating Tomatometer and drag it to the Rows container and see how it changes the graph. Now, items are very bunched up at the right side of the graph (more recent) as well as at the top (higher Tomatometer rating).  Weirdly, we do see a percentage of lower tomatometer ratings later in the graph, especially in Sports & Fitness, Televison and Western films (which interestingly, if you'll recall from last lesson, are all genres with less than 100 films on their lists, so maybe the compilers of those lists weren't checking Tomatometers when they made those lists). 
  • Let's format both axes to make what is being shown clearer though. For one, the Tomatometer is a rating that operates as a percentage, you hear about a movie getting 100% on Rotten Tomatoes, not being a 1.0 so let's change that axis to be formatted as a percentage. Right-click on the Rating Tomatometer axis and choose Format. Then in the Numbers dropdown, select Percentage, and choose that you want there to be 0 Decimal places
  • Let's also give this graph more room by changing the bottom axis to show a narrower range. If you double-click on Year of Release Date at the bottom, you'll see that it's displaying from 9/9/1913 to 3/21/2027. Now...our data only goes up until 2021, so let's change the axes to reflect that, giving it a Fixed start of 06/01/1918, and a Fixed end it at 12/31/2022 to give it about a year and a half on either side of the data.
  • The Year of Release Date axis now shows a range that's more reflective of the kind of data that you've fed it, but it's still a little crowded. If we better want to see how different genres are distributed over time, we can use Pages. 
  • Let's keep this sheet untouched though in case we did want to refer back to it. Right-click on YearAndTomatometer - Scatter and choose Duplicate
  • To make this tab distinct from the one you duplicated it from, add Pages to the front of the tab name and add By Genre to the title at the top of the graph.
  • If you did the last lesson you're already aware, but adding pages is very easy. Simply click on the dimension that you want the graph to be split by, in this case Genre  and drag it into the box marked Pages
  • When you do this, the graph will immediately switch to only display the Action & Adventure entries in the list (since that title is the first, alphabetically). If you want to page through and see what this graph looks like for each of the different Genres, you can use the box with the arrows with it to the right of the chart.
  • Page through and see what stories this new version of the chart tells you. It does look like for some genres, there is a pretty significant increase in movies listed after Rotten Tomatoes went live, and in some cases that increase came with movies on lists with a lower Rotten Tomatoes ratings than the movies that had been rated retroactively after the site went live. For movies with the genre of Western this seems particularly pronounced. What other trends do you see when looking at genres individually instead of the movies as a whole?

Creating Bins to Capture a Range of Data

If you have numeric data that can be more useful to look at as a range, you can separate it into bins. In this case we will use it to look at our Release Date data in terms of decades like the 1920s, 1930s etc. You might someday have say, data that wants to be grouped in other numeric ways like if you want to look at population data by income bracket, or air quality data by the grades that different measures fall into on the index, and might decide to put your data into bins. Like the name says, it's kind of like you're making a bin or bucket and telling Tableau how to choose what data goes into that bucket. This is a way that you're creating another Dimension that you can use with your worksheets.

  •  Click on Release Date in the Tables section and choose Duplicate. You'll be making a separate version of this measure so you can play with it without screwing up the graphs you've already made. 
  • Select the newly created Release Date (copy) and choose Create and Calculated Field. Calculated Field is a new field that you can create that will be drawn from a calculation applied to some other field that you have. One example might be if on our table we had director information for each movie that contained that director's birthdate. If we wanted to add the director's age at a film's release to a worksheet, but had neglected to record that data in a field on our original table, we could create a Calculated Field  that subtracted the director's birthdate from the film's release date to add information to a field containing the director's age. We won't be dealing with Calculated Fields much in this lesson, but it's a good tool to know you have if you go to graph your data and realize that you haven't created a field you'd like to be able to graph.
  • In the menu to the right in the Calculated Field window, type in YEAR to search for that calculation and you'll see that that is a calculation that will take the year of a given date and make it an integer
  • In the section to the left, Type YEAR( to the left of the name of the field and then put a closed bracket ) on the other side of it. It will tell you that the calculation is valid, change the name where it says Calculation 1 to instead say Decades and choose OK 
  • In the new field that you created, Decades, choose Create and Bins
  • In the Edit bins window that pops up, choose that you want the size of the bins to be 10 so that movies from 1920-1929 will all be in one bin together, ditto for 1930-1939 etc. 
  • Now you'll see there's a new measure in the tables marked Decades (bin) 
  • Create a new worksheet, using the Worksheet item on the menu bar to select New Worksheet
  • Let's give this new sheet a configuration similar to the one we created earlier when we looked at movies per month. Click on Decades (bin) and drag it into the Columns container, since you want its axis to be at the bottom
  • Next, let's say that what we are most interested in is the number of movies for each year, like our earlier graph. Click on top_100_movies_by_genres (Count) and drag it into the Rows container 
  • By default, it has created a bar chart for me, If for some reason, you got something different, click on the dropdown in the Marks card and go from Automatic to Bar. 
  • Speaking of the Marks card, while this graph does show us a bit about which decades have the highest number of movies on their lists, let's add a bit more information, and segment it by Genre. Click on Genre under Tables and drag it onto Color on the Marks card.
  • If we wanted to find out what genres are associated with what decades on Rotten Tomatoes's lists, this might be a better possibility than the previous graph. Here we can see that not surprisingly, Classics dominates 1930-1950's contributions to Rotten Tomatoes lists, and is pretty strong in 1920 and 1960. Looking at those earlier years too it seems like Romance rose from 1930-1950,  and in 1950-1970 Mystery & Suspense made up another significant segment.
  • But the huge jump on overall titles on Rotten Tomatoes lists between 1990s and 2000s, followed by another huge jump between 2000s and 2010s makes it hard to tell exactly which movies' genres are responsible for those jumps, although more than half of all the movies on the Action & Adventure list are from 2010s or 2020s. But like with the other graph we can choose to sort the data by descending amount by genre by clicking on any year at the bottom of the axis and then on the third button over to Sort by Genre, Descending. 
  • Doing this we can see that actually for 2000s  the highest percentage is Special Interest, which if you look at what's on that list seems to be a real mixed bag, maybe more of a miscellaneous category. By doing that for each year in turn you can see which genre lists a decade made its biggest contributions to. 
  • Let's name this tab and title the worksheet. I'll call the tab MoviesPerDecade, and make the title Number of Top 100 Movies by Decade.

Saving Your Work

Now it has come time to save your work.

  • Go to File and choose Save to Tableau Public as
  • A pop-up window will appear asking you to login to the the account that you set up at the beginning of this tutorial. Log in to the account
  • A window will pop up giving you the caveat that you should make sure that this graph doesn't contain private or confidential information as it's going to be on the Tableau Public website and asking you what you want to name it. For this data the bit about it not being private doesn't matter, but note that if you are dealing with any private data, you'll want to either use a different program, or spring for one of the other Tableau options that is not free, like Desktop or Server. 
  • Give the workbook a name, I'll name it RottenTomatoesTopMovies-Lesson2 and choose Save.
  • It will process the request and then when it is done, it will open the workbook you've made in a new tab of whatever web browser you have open. Whenever you save your work a new version of this window will appear with whatever changes you have made. 
  • To note, it will only display the actual worksheet that you have open so if you want to see one of the different ones in the Tableau Public websites, you'll need to click into that worksheet and choose Save to Tableau Public. This seems to be a more recent change, so though it is cumbersome, maybe it is just a temporary bug. 
  • If you want to use an image of one of the graphs that you've made, you can either take a screenshot or use the Download  option, which is the third icon from the right of an arrow in a box, and choose to download as an image or PDF or Powerpoint, whatever format will help you. Note that all of these things are going to be static, however, and not be interactive. 

Here's the link to my workbook that you can go to and download my workbook from if you got tripped up at any point and want to see how I made it work. In fact, a nice thing about Tableau is that often you can download existing graphs that you like and in a lot of cases see how people whose graphs you'd like to emulate, set up their graphs. Check out their gallery for examples of the work that some people have done (of course, keep in mind that some of them might be using Tableau Desktop or one of the other paid options, so you might not be able to duplicate every neat graph that you find).

Click to the next tab to work more with Tableau Public, the next lesson will be focused on using Tableau to create maps.

Lesson Goals

For this module, we’ll only be working with country level data, but if you have state, city or even point data with latitude and longitude information, you can use Tableau to map that as well. It’s important to note that Tableau has more limited layering capabilities than say an ArcGIS or QGIS. Some movies in my data have 11 Production Countries associated with them, but I’ll only be able to meaningfully add the first one or maybe two to the map. 

Getting Started

  • If you have skipped to this tab and didn't do the first two sections of the exercise, follow the below instructions.
    • Go to https://public.tableau.com/en-us/s/download and download Tableau Public. After it has finished downloading, install it. 
    • Visit the Sign Up page and create a login for yourself, and make sure that the password is something that you will remember. You must create a login as you will be logging into Tableau every time you save your work for the site when you are using Tableau Public. 
    • Download the below excel file, RottenTomatoesTop100Movies and open it in Excel or in Google Sheets. Save it somewhere you can find it again easily, since you'll be opening it again shortly in Tableau.
    • If you have skipped into this section and didn't do the exercises or read the last section, please go back and read the Getting Started section regarding what this excel file contains and the Worksheet Interface section in the Different Kinds of Charts lesson, then come back here

Connecting to Your Data in Tableau

  • Open Tableau Public from the start menu. It will open up on a menu asking you what file you want to connect to. The file that you connect to is the the file whose data Tableau will be using to create your worksheets. Chose Microsoft Excel under the options for Connect to a File.
  • In the Explorer Window, navigate to where you have the RottenTomatoesTop100Movies sheet, and select that you want to open it. 
  • Take a look at the different field options to make sure all of them look like they are the correct data type. For this exercise, that should already be the case. 
  • Click on the Sheet 1 tab at the bottom to open a new sheet in Tableau

Creating a Basic Map in Tableau

When creating previous worksheets you may have noticed some of the items in the Table that we didn't really deal with were the ones that had a globe looking icon next to them. Production Country 1-11, Latitude (generated) and Longitude (generated). These were recognized by Tableau as Geographic data, so that's the reason for the icon. Let's create a simple map with this data. 

  • Click on Production Country 1 and drag it onto the main field of the worksheet. This automatically generates a map where all of the countries with at least one movie listed as Production Country 1 are marked with a dot and the Columns and Rows containers are filled with Longitude (generated) and Latitude (generated).
  • At the bottom of the map, there is a small sign indicating that there are 3 unknown. Click on this to see more
  • It gives you the option to either Edit Locations and correct the locations, Filter Data, which will just exclude these values from the view and any calculations you might want to do or Show data at a default position, which will let you pick a point somewhere else on the map where these null values can go. 
  • Options like this in Tableau will show you where your data may be incomplete, so if this came up when you were graphing geographic data you thought was complete, you might go back to your spreadsheet,  correct it, and open a new workbook after you've edited that data. For now, let's select Edit Locations...
  • In the Edit Locations window, it will show you all the matches that Tableau has made. You can scroll through if you want to make sure there are no issues. At the top of the chart is the unmatched entries in Production Country 1:  [], Null and Soviet Union.  Now, were this a real project, for the Soviet Union films, I'd do research to find out where in the Soviet Union the film was produced, as perhaps it was in some country that is now a separate entity on this map. Then I'd make a judgment call about which country to list it under, probably by editing the films with this designation on the sheet itself. For now though, I'll look up the coordinates for Moscow and select Enter a Latitude and Longitude and type in  55.7537147 for Latitude and  37.6176958 for Longitude
  • For [] I'll select Enter a Latitude and Longitude from the dropdown list and choose 0,0 as that will put it in the middle of the ocean. 
  • Right now, this graph is just telling us which countries have had a film produced in them that has been on the lists at some point, and which have not. This is kind of interesting, but we don't have an idea of scale, like how many for each country?
  • Let's add another measure to this chart by clicking on top_100_movies_by_genre(Count) and dragging it onto the Color card. This changes the graph so that the entirety of a country is filled in, but we see that almost every country is the same color, except for the United States. But hovering we do see there are some differences, The US has the highest number of movies with 963, but though the UK has the next highest with  172 it is virtually the same color as say, Norway above it which only has one. We need to edit how the colors are distributed on this chart. 
  • Click in the corner for the legend for CNT(top_100_movies) and choose Edit Colors. Right now it's an automatic blue spectrum, but let's see what happens when we choose two diverging colors so that a country's distance from another on the spectrum is more apparent, and chose Temperature DivergingClick on the check next to Reversed. You can try a bunch of different options and click Apply without exiting the Edit Colors if you want to preview how the options will look on the map. With this configuration there seems to some detectable difference between countries that have only a few movies and some that have a higher amount (though still not as high of a representation as the US), although Australia (with 25) and India (with 5) appear to be virtually the same color, so taking this across all genres and years doesn't yield the most informative chart. 
  • Let's see if subtler differences emerge using other measures, but first let's rename this graph and tab so that when we create more it's still findable. Double-click on the bottom tab where it says Sheet1 and let's call it MoviesPerCountry. Change the title at the top by double clicking to open the Edit Title window, and change it to Number of Top 100 Movies Per Country.
  • To use this graph as a basis for others is pretty easy, and it means we get to skip ahead a few steps. Right click on MoviesPerCountry and choose Duplicate, and it will create a new worksheet with all the same setup. 
  • Let's look at No. of Reviews for this map. Click on No. of Reviews  in the Tables and drag it onto colors. Predictably, we run into the same problem, since a lot of movies means a lot of reviews to total up, but that's because it's going with the default of SUM. Let's choose Average  instead.
  • Click on SUM (No. Of Reviews) in the Marks Card, and choose Average from the Measure menu.
      
  • The map shifts dramatically, in part because, the US having so many movies, actually probably has quite a few that users didn't review as much. China and Australia's numbers also go way up. 
  • Change the name at the top of the chart and the tab to reflect what this one actually shows, so you're not confused if you stumble across it later to AvgReviewsPerCountryMovie and Average Reviews Per Country's Movies in Top 100 Lists

Adding a Marks Layer to the Map

Now, you may have noticed that there are a lot more production countries options listed in the Tables section than just the one currently displayed. However, Tableau can only show a limited number of layers atop each other at a time, so if you wanted a map that truly counted up every time a country appeared listed for a movie as Production Country 1 through 10, you'd have to organize your data differently. However, let me show you how you can add in a layer to at least display 1 of the other other fields atop the map. This can be useful if you say...want to show subway stations as a dot over a census tract, with the census tract itself displaying some other kind of information, like say, median income or population density. 

  • Let's start by duplicating that first map we made, MoviesPerCountry by right-clicking on it and choosing Duplicate
  • On this new map, click on Production Country 2 and drag it towards the map. A pop up will appear that says Add Marks Layer. Pull Production Country 2 onto that item
  • The Marks card now has two options on it, Production Country 2 and Production Country 1. There is a little circle symbol next to Production Country 2 indicating that it is a circle display and a map symbol next to Production Country 1 indicating its display is a map. You can move between editing these two displays by clicking between them. 
  • Now our map has a bunch of grey dots on it, lighting up over every country that has been listed as Production Country 2. But also, note a little box letting us know that there are 2 unknown locations has appeared in the corner. Click on it, and choose that you want to Edit Locations.
  • The unknown locations are Null and Soviet Union. Let's do the same thing as before with the same caveat. I'll be changing the matching location for Soviet Union to be 55.7537147 for Latitude and  37.6176958 for Longitude just to show how to do this. 
  • Next, let's add a little bit of extra information to this by making these dots have differing sizes based on how many movies a country was Production Country 2 for, by clicking on top_100_movies_by_genre (count) and dragging it atop of Size on the Marks card for Production Country 2
  • This does make it so some of the dots are too small to see, so let's edit the sizes by clicking on the legend for CNT(Top 100 movies) and choose Edit Sizes
  • Select that you want the sizes to vary by range, and move the arrow over on the left hand or Smallest size until the dot next to one on the Sample legend looks big enough to see. Choose Apply to make sure that did enough on your map for you to see any country's dot
  • That default grey color on the dots is so hard to see. Click on Color on the Marks for the Production Country 2 layer and pick a nice light blue that you know will show up against the oranges and greens, and then choose that you want there to be a black border.
  • Voila, you've represented which countries have been either Production Country 1 or for any of the movies on these lists and given a rough idea or about how often that occurs. 
  • If you want to play around, you can add Production Country 3 as a layer as well, but when I've done so and changed the layer type to Shape or Square instead of Circle, it makes the marks on each country a bit too crowded for this to stay an intelligible graph. So if you're looking to graph a lot of layers of geographic data that will overlap each other, you're better off with ArcGIS or another program primarily intended for mapping.
  • Let's name the tab this sheet is in and change its title at the top. Add 1&2 to the end of the tab name and change the title at the top to read Number of Top 100 Movies Per Country 1 & 2

Creating a Set

The map we've created shows quite an American-centric presence on the lists, but let's see if that has changed over time. We could use the pages function to create a page for each year, but instead let's just divide the display up into two categories, movies released before the launch of Rotten Tomatoes, and movies released afterwards.

You can use the Sets function to divide the your data up based on whether or not it fits a certain criteria. This can mean that you select a group of dimensions (i.e. if I wanted to create a set that just had Horror, Science Fiction & Fantasy, and Mystery & Suspense genre films in it). This can also mean you choose something more conditional, like if I wanted to build a set that just had the top 10 movies of each list. Once you have created a set, it is binary, a given row of your data is in the set or it is not. So it's similar to creating a Filter except for both the data that fits the criteria, and the data that does not are available to be displayed on the same graph.

In our case we'll be creating a set using Release Date but there are lots of possible applications. Maybe you're looking at the career outputs of different directors and you want to divide them into directors who have ever made a horror movie and ones who have not to see if their movies have higher or lower budgets on average? Maybe you're looking at automobile accident rates in different counties and you want to split the data into counties that have devoted at least a certain percentage of their budget to road repair, and ones that have not. Sets can be used for all of these.

  • Right-click on the map you just created and choose Duplicate
  • Now in the Tables section of the interface, go to Title click and choose Create and Set
  • In the Create Set window that opens, change the title to Before Rotten Tomatoes and select the tab for Conditional, because you only want to include the rows within this set on the condition that their release date was prior to 8/12/1998 as that's what we found was the launch date of Rotten Tomatoes. 
  • In the Conditional tab, select the button for By field, and in the field that you want to look for conditional matches in, choose Release Date. In the dropdown next to it, select Maximum, because you will be setting a maximum limit for how late the Release Date can be before it won't be included in the set. 
  • Choose < from the dropdown that by default contains = and type in 8/12/1998 since the maximum number you'll accept for a release date is anything less than 8/12/1998
  • Select OK, and you'll see that in the Tables option on the side, there is now a new option to be included in the chart with an icon of interlocking circles that says Before Rotten Tomatoes. 
  • Click on Before Rotten Tomatoes and drag it into the Columns container, and the graph will change. Two maps will appear alongside each other, one that has In written at the top and one that has Out written at the top. The map that is under the heading of In has the map that corresponds with all of the movies in the data with a Release Date before Rotten Tomatoes launched, and the one under the heading of Out has the map that corresponds with all the movies with a release date after Rotten Tomatoes launched. 
  • You can verify this worked by clicking on any of the country outlines or dots and scrolling through the Full Data tab to note that indeed any movie on the In side of the map has a release date before 8/12/1998, and any movie on the Out side has a release date that is after. It does look like there are a lot more countries represented on the other side of the graph, but we should keep in mind that doesn't mean there were more movies released from those countries after 1998. If you've been reading since the last tutorial, you'll recall there was a huge jump in movies on the lists that were from 2000 or later as users began contributing reviews for a lot of movies that had been released recently for them. In general, as you use data visualization programs you should always be clarifying to yourself what your data is, what isn't shown in the data, and what questions your graph is answering, and what questions it might not be. 
  • Let's change the title and some of the labels so that this is clearer. Change the tab for this graph to be named MoviesPerCountry1&2PrePostRT. Right click on the part of the graph that says In and choose Edit Alias. The Alias is what a different field or dimension will appear as on your chart. Editing it will not actually change your data, but it can make your graph more intelligible to your reader who might not understand what some of your data labels mean at first glance. In the Edit Alias window, name it Before Rotten Tomatoes Launch.
  • Select OK and you'll see that the label has changed. Do the same thing for Out and name it After Rotten Tomatoes Launch
  • To be noted, we've included both items within the set and items outside of the set on the graph, but if you want to get rid of one, you can always click on its heading at the top and choose Exclude. For instance, maybe if I wanted to only look at the movies that are included in my set, i.e. the ones that came out before Rotten Tomatoes existed because maybe I want to see what those movies might have in common that made them more frequently revisited after Rotten Tomatoes launched, I can just click on After Rotten Tomatoes Launch at the top of the graph and choose Exclude
  • You can use Sets as an additional categorization of your data alongside of normal Filters or Pages functions. For instance, in order for us to see how international movie lists were before and after Rotten Tomatoes's launch by genre, all we'll have to do is click on Genre and drag it onto the Pages card, and flip through to see how the information looks broken out in that fashion (for more information on Pages, see the last two sections of the tutorial).

Saving Your Work

Now it has come time to save your work.

  • Go to File and choose Save to Tableau Public as
  • A pop-up window will appear asking you to login to the the account that you set up at the beginning of this tutorial. Log in to the account
  • A window will pop up giving you the caveat that you should make sure that this graph doesn't contain private or confidential information as it's going to be on the Tableau Public website and asking you what you want to name it. For this data the bit about it not being private doesn't matter, but note that if you are dealing with any private data, you'll want to either use a different program, or spring for one of the other Tableau options that is not free, like Desktop or Server. 
  • Give the workbook a name, I'll name it RottenTomatoesTopMovies-Lesson3 and choose Save.
  • It will process the request and then when it is done, it will open the workbook you've made in a new tab of whatever web browser you have open. Whenever you save your work a new version of this window will appear with whatever changes you have made. 
  • To note, it will only display the actual worksheet that you have open so if you want to see one of the different ones in the Tableau Public websites, you'll need to click into that worksheet and choose Save to Tableau Public. This seems to be a more recent change, so though it is cumbersome, maybe it is just a temporary bug. 
  • If you want to use an image of one of the graphs that you've made, you can either take a screenshot or use the Download  option, which is the third icon from the right of an arrow in a box, and choose to download as an image or PDF or Powerpoint, whatever format will help you. Note that all of these things are going to be static, however, and not be interactive. 

Here's the link to my workbook that you can go to and download my workbook from if you got tripped up at any point and want to see how I made it work. In fact, a nice thing about Tableau is that often you can download existing graphs that you like and in a lot of cases see how people whose graphs you'd like to emulate, set up their graphs. Check out their gallery for examples of the work that some people have done (of course, keep in mind that some of them might be using Tableau Desktop or one of the other paid options, so you might not be able to duplicate every neat graph that you find).

Mapping with Tableau doesn't offer as many options as say, ArcGIS or QGIS, but if you have basic level geographic data that you want to visualize in a straightforward way, it is a pretty good option that let's you do a lot of customization of symbolic elements, even if it doesn't let you get as sophisticated on the geographic aspects.