Lesson Goals
Tableau is a data visualization program that can be used to create many different kinds of tables and charts. With this module, we'll explore Tableau's interface and create worksheets containing different common graphs using a small subset of data about Rotten Tomatoes top-ranked movies for each genre.
Getting Started
- Go to https://public.tableau.com/en-us/s/download and download Tableau Public. After it has finished downloading, install it.
- Visit the Sign Up page and create a login for yourself, and make sure that the password is something that you will remember. You must create a login as you will be logging into Tableau every time you save your work for the site when you are using Tableau Public.
- Download the below excel file, RottenTomatoesTop100Movies and open it in Excel or in Google Sheets. Save it somewhere you can find it again easily, since you'll be opening it again shortly in Tableau.
- Take a look at the data to get an idea of the kind of information that you can visualize with it. It has a list of 100 (or sometimes less) movies that are rated as being the top of their genre with each row containing one movie. Left to right it lists:
- The Genre movies are being ranked in
- The movie's Rank for that genre
- The movie's Rating on the Tomatometer (Rotten Tomatoes' system where it aggregates user and critic's reviews)
- The Title with and without the year
- The number of reviews the movie has on Rotten Tomatoes.
I got this dataset from the dataset and tutorial website Kaggle
- For the movies listed in the Rotten Tomatoes Top 100 in this dataset, I've added information I got from The Movie Database by querying their API. More information on querying an API is available in the Using APIs section of the DAsH research guide. This data from TMDB is
- The movie's Runtime in minutes or how long it is.
- The movie's Release Date (Rotten Tomatoes' dataset only mentioned the year, and sometimes that was only its US Release Year).
- The Production Country or Countries associated with the movie.
It's important to keep in mind where the information came from and what that means for what is being measured with a graph made with that information. Even at a glance, it does seem like the higher ranked movies tend to be pretty recent films on this spreadsheet, so it's very possible that our graph would look very different from someone trying to look at a similar ranking compiled five or even three years ago, or a ranking compiled by different users than Rotten Tomatoes' user reviewers and the critics that the site looks at reviews from.
- There are over 1600 rows on this sheet, so it’s hard to see exactly what the overall findings are. Even if we created a pivot table summarizing some aspects of the data, it wouldn’t be able to tell the story of what the data shows with as much impact as a graph, so that’s why we’ll be using Tableau to create a graph of the findings.
- Think about the different kinds of information you could display with this, you could graph
- What years are popular for what genres?
- If different genres have different runtimes? If runtimes are increasing or decreasing for more recent films?
- What countries are producing movies popular on Rotten Tomatoes? Has that changed over time? Is that different by genre?
Connecting to Your Data in Tableau
- Open Tableau Public from the start menu. It will open up on a menu asking you what file you want to connect to. The file that you connect to is the the file whose data Tableau will be using to create your worksheets. Chose Microsoft Excel under the options for Connect to a File.
- In the Explorer Window, navigate to where you have the RottenTomatoesTop100Movies sheet, and select that you want to open it.
It will take a few minutes to open up your file and then give you a screen where it shows you the first few rows of your data. This is where you can make modifications as to what kind of data the fields contain, join the data from two different sheets in the Excel file you've connected to or even create a new field that performs some kind of calculation on your data. We'll only get into the first option in this module, but feel free to experiment on your own for others.
- In this case, let's change the data type for Year. The data types Tableau accepts are Dates, Numbers, String (text), or Boolean (true or false).
- When you look at the Year column it has a little number sign on top of it. That means that the data type for the column is numbers. But we want to graph this data over time, so we want Tableau to understand that what this column contains is a date. Click on the little number icon.
- From the dropdown menu choose Date and the icon next to Year will switch to a calendar and the numbers in the column will switch to look like dates.
- Let's make sure that the other items have the right designation, the ones with numeric data have the little # and the ones that are countries have a globe next to them indicating that they are geographic data. We can do this by scrolling down in the table to the right marked field, and we see that the fields that contain text such as Genre and Title have the icon next to them indicating they are text. The items that contain number information like Rank, Rating Tomatometer and No. of Reviews have the # sign next to them, indicating they are numbers and all the Production Country fields have the globe icon next to them indicating these fields hold geographic data.
- We've made the modifications that we want to make to how Tableau identifies the types of data in our sheet, so let's move on to the worksheet. But remember, if we want to change anything about the actual data being used in the graph, whether that's changing the data type, adding in a calculated field or creating a relationship between different columns in two sheets, one of the places we can edit this is the Data Source button at the bottom of the Tableau interface.
The Worksheet Interface in Tableau
- Select the box where it says Sheet 1 at the bottom of the screen and it will send us into a blank Worksheet interface. Worksheets are where to make graphs. At the bottom of the Tableau interface will be tabs that will let us toggle between different worksheets, dashboards or stories, as well as icons to create a new version of each of those kinds of documents. We'll be dealing with worksheets in this tutorial.
- To the left of the screen is the Data that exists for our sheet, underneath Tables are the different Dimensions and Measures that make up the data that is on the spreadsheet we've connected. There is also a tab for Analytics where we can add trend lines or other kinds additional reference lines to graphs.
- This is where we can see which Dimensions or Measures are available to be displayed in the rows or columns of the chart or graph.
- Dimensions are qualitative data, so mostly words (or strings in Tableau's terminology). They are the categories, classifications or other information that give the measures meaning. Dates are grouped with Dimensions but can be used as Measures as well. Dimensions are coded with blue boxes and symbols and appear that the top of the data available under Tables. For this sheet they are:
- Genre
- Production Country 1-11
- Release Date
- Title
- Title Without Year
- TMDB ID
- Year
- Measure Names - a field generated by Tableau which can be used for labeling
- Measures are quantitative data, meaning columns that can be summed, added, subtracted, averaged, etc. Nearly any kind of numeric data is a measure, with the exception of, for example, an ID number. A sum of everyone’s MC ID numbers wouldn’t really tell us anything, so it isn't really measuring anything, so it's a dimension. Dates fall somewhere in between as to whether they tell you something based on their numeric quantities so Tableau will let you treat dates as either a measure or a dimension. More on that next section. In Tableau, Measures are coded with green boxes and symbols. The Measures available to be added to this sheet are at the bottom
- No. of Reviews
- Rank
- Rating Tomatometer
- Latitude (generated)
- Longitude (generated) - These two measures are generated from the country names by Tableau.
- top_100_movies_by_genres - a field generated by Tableau that will count the number of rows in the sheet that contain a given measure or dimension. Whenever we want the graph to display how many rows match a certain criteria, this is what we'll add to your worksheet.
- Measure Values - a field that adds all measures as a group to the graph.
- The next column over contains another group of boxes or cards where we can control what information winds up on the graph.
- Pages is a function where we can add digital pages to your graph, which could be useful if we wanted to say, see what the graph looked like for different months of a year or when broken out based on different Dimension values (we'll use later when we want to see how the graph looks for movies of different genres). When you can do this you can flip through digital "pages" to see how the graph differs based on when that Dimension changes. We'll be getting to that later in the tutorial.
- Filters is the next box down. This is what to use to only deal with certain ranges of the data. So if we had say, air quality data for the Bronx, and we wanted to only look at the summer months, we could filter the graph so it only included data for the months of June, July and August.
- Marks is the box to use to control what kind of graph we are making (scatter, bar, line or other options) and lets us use color, size or shape to signify different aspects of the data. This is also where we can control what aspects of the data will show up as labels or when people click on a section of the graph.
- In the middle of the screen is the canvas where we will see how the graph that we are creating looks. Above that are the Rows and Columns containers where we can drag in the Dimension and Measures that we want to be displayed as the row or column section of the chart. Above that section, there are also icons that will let us swap which bits of data are the row and which are the column, that will change how it is sorted, that will control how the graph is made to fit to the display, and other options.
- Finally, at the top of this interface is a box marked Show Me. When we click on it and hover over the picture of the type of graph we want to make, information will appear the the bottom of the window letting us know what kind of information we need to add to the graph to accomplish to that. Whether the data to use is Dimension, Measure or both and how many of each is needed.
Now that we have some idea of what functions on Tableau tie to what outcomes on the graph, let's make some graphs!
It’s a good idea before you start hauling over measures or dimensions to first come up with a mental image of what you want your graph to look like. Not what you actually expect the data to tell you, but how you want it organized. If it’s a line chart, what measure is horizontal and what is vertical? Is one of your measures numeric and the other non-numeric and so something like a bar chart makes more sense? You can do this in your head or just sketch it out on scrap paper. Tableau isn't always the most user-friendly so it's better to have some idea of how you want your graph to be organized.
Creating a Simple Bar Graph in Tableau
Let's start with a simple bar graph to see which genres of movies have the most user reviews. I've done a little thumbnail sketch for myself to think about how I want the graph set up and have decided that the genres will be on the lined up horizontally with the amount of reviews that they have in total determining how tall the bar is vertically.
- In the column that says Tables at the side of the screen, click on where it says Genre and drag it over to the Columns container at the top of the screen. This will make it so the data that pertains to each genre is held within its own column. A preliminary form of this chart appears, but we haven't told it what measures for Genre that we want to display in the rows so Tableau is just displaying a placeholder version where the different Genre types are broken out into columns.
- The kind of numeric data that we want for each of these genres in the place of those placeholder Abc is the No. of Reviews. Click on that item in the Tables box and drag it onto the Rows container at the top of the chart.
- On my sheet it automatically sets up the graph as a bar chart with the Sum total of the number of reviews for each genre displayed as how high the bar extends vertically. If your displays as something else, you may have to select the Bar options manually from the dropdown on the Marks card, by clicking on the arrow next to Automatic and finding that option.
- By default, the graph lists the genres in alphabetical order, rather than in order of how many reviews each had. From a quick look, it appears that Drama movies have the most reviews, followed by Action & Adventure while Television and Classics are at the bottom, but if you wanted to summarize the findings based on exactly how they were ranked, you might want to sort it in numerical order.
- Tableau includes a couple of ways that you can change how a graph is sorted, one is an icon at the top of the interface, another is an icon that shows up when you hover over the vertical axis. I'll choose the one that allows me the most customization, I'll click on Genre in the Columns container and click on Sort on the menu.
- That opens a Sort menu. I'll select that I want to Sort by Field from the dropdown and since I want the Genre bars to be sorted in the order of having the most to the least number of reviews. I'll make sure that No. of Reviews is selected in the dropdown next to Field Name and the Sort Order that I choose is Descending. Thankfully this window will preview the changes that you are making on the graph below it, so you can see right off if you are getting what you want or not. There are other options you can choose, one of which is Manual so that if you need to you can customize the order.
- Use the X at the corner of the window to get out of the Sort window. Instead of being ordered in alphabetical order the bars are now sorted into the most to the least number of reviews from left to right, which can give a better idea on which genres' top 100 movies are reviewed a similar amount of times.
- Now you may have noticed that where it says No. of Reviews in the Rows container it actually has SUM written at the front. This is because the graph is displaying the total number of reviews for all the movies on the list for a given genre, but there are other options you can display here. Click on the Sum(No. of Reviews) slug and by hovering where it says Measure (Sum) you can see the other options for the numeric data it can display. Try out Median and Average to see how it changes the graph. Since in this case most of the lists have the same number of datapoints, it doesn't change it too too much, but if you had a reason to want to look at the average or median measures for a figure rather than the sum total this is where you would change that. After you poke around, switch back to Sum.
- There's a large variation between the highest and lowest number of reviews per genre, and sometimes when there's such a discrepancy that can mean you should explore further to see if there's any missing data or duplication going on. For each of the bars (or any of the kinds of marks we'll have on later graphs), you can see more information about the data that makes up that representation on the chart by clicking on that bar. Let's start with Television at the end, since it is the lowest.
- It gives you at first the same summary information that you can get by hovering over any of the bars, but to see more, you can click on the series of bars at the far left corner of that stub to View Data
- It will take you to a summary window, and you'll need to click on the Full Data option at the bottom to see all the data that makes up this bar on the graph. You can scroll through this the same as you would have been able to look through the spreadsheet, but one thing that jumps out right away to me is the fact that it only lists 57 rows at the top, despite the fact that the data that I originally downloaded was for the Top 100 films for each genre.
- Click out of the window. So now we know that one of the reasons that the Television bar has so many less total reviews is because it included a lot less films. How many of the other genres with smaller numbers of total reviews also have smaller numbers of movies in them than 100? Let's click around and see.
- By following the same process, clicking on the bars for genres with a smaller number of reviews, and choosing the tab to view their full data, we see that Television isn't alone in having less than 100 movies. Sports & Fitness only has 69 rows, Western only has 86, but Classics despite its smaller number of reviews does in fact have 100 movies on its list. How might we change our graph to reflect the information that not all genres have 100 movies listed for them? In the next section, we'll use a filter to narrow down what we show in our graph, but first let's rename this sheet we've made
- Double click at the bottom of the page where it says Sheet 1 and change the text to say Genre&Reviews-All
- The title at the top of the graph will automatically change to match the new name you've given this sheet, but let's change it to something clearer.
- Double click at where it says Genre&Reviews-All at the top of the graph and it will open the Edit Title window. Automatically it is populated with only something that says <Sheet Name>
- I'm going to title it Number of Rotten Tomatoes (RT) Reviews for Top RT Movies by Genre and select Apply. The title will change. To note, this is the window that you can use to choose the font, color and size that the graph of your title appears in, but in my case, I'm not going to play with that.
Filtering Your Graph in Tableau
Sometimes you may want to exclude parts of your data from a project. Maybe you're creating a graph where you only want to look at a certain portion of your data to really focus in on it. Maybe some of your data is incomplete and you only want to include the stuff that meets certain criteria so that you're sure you're making fair comparisons. In either scenario, Filters can help you create graphs that only looks at portions of your data.
- We're going to create another version of our graph showing the total number of reviews per genre and filter it, but we don't need to complete all the steps to create this graph all over again, we can simply Duplicate it, which will just fully recreate this sheet in a new tab so we can customize it in a new way.
- Right-click on the tab for this worksheet - Genre&Reviews-All and choose Duplicate. This will be a process that you will repeat quite a few times throughout these tutorials since it's a really neat way to easily try out different configurations on a graph without either erasing the one you've already made, or having to remake it from scratch.
- In its tab, rename this sheet Genre&Reviews-Filtered so you can tell it apart from the earlier one.
- Since Filters are a way to either make sure to include only certain sections of your data, or exclude certain parts of your data, this is where we decide what belongs on our graph. If we decide that Television films in our graph about movie just don't fit, we can just click on the bar for Television and in the middle of the pop-up window that appears, choose Exclude.
- When we do this, two things will happen: one, Television will no longer be one of the genres included in the graph and two, in the Filters box on our chart, Genres will appear. Even though you didn't explicitly create a Filter, by choosing to exclude, you were actually using that tool. In Tableau, there are often multiple ways to accomplish the same thing.
- Click on the Genre bar in the Filters box to open the Filter window for this field. It opens on a list containing all the options in the Genre field. Note that the Exclude box is checked, and none of the items on the list are selected, until you scroll down to Television, and you'll see that box is checked and there is a line drawn through the name. With this menu if you were looking to only include or exclude certain Attributes by name you'd be able to do so by scrolling down this list and checking their boxes . A summary of what's been selected is at the bottom of the window.
- Next, let's make sure that we're only showing the Genres on this chart that have at least 100 movies in their category so we're comparing apples to apples. If there aren't 100 movies' reviews being totaled up for each of these genres, how do we know if the larger numbers for some of these genres is based on whether they have 100 movies on their list, or if that genre is truly reviewed more often on the site than other genres. One way we could exclude all genres with less than 100 movies, is go back to our data and make a note of all the genres that have less than 100 movies in them, or we could do something much simpler and more foolproof, create a filter to only show a Genre on the chart if it has 100 rows in it.
- Select the tab that says Condition in the Filter window. Choose the radio button next to By Field: which will let you only choose certain fields. In the dropdown below it, select the field top_100_movies_by_genres, which is a field generated by Tableau that will counts the number of rows that fall into a particular category. Leave Count and = selected in their dropdowns and type in 100 as the Count that you'd like. To translate this set of dropdowns into more plain language, you're creating a filter that will only include Genres on the condition that there are 100 rows of it. Select Apply and you'll see the graph change to drop off several of the shorter bars, which must have been for Genres with under 100 movies in them.
- There are different conditions you can add depending on the field you want it to filter based on, and it doesn't even need to be a measure that you have added to display on your graph.
- If we wanted to only see the number of reviews for Genres where the minimum on the Tomatometer was less than 80%, you can change the field to be Rating Tomatometer, choose Minimum on the dropdown, select < from the dropdown next to it and type in .8 and you'll see the graph change to only display the bars for Action, Science Fiction, Horror, Western and Sports & Fitness. So you'll know that these are genres that the equivalent of a B- is good enough for Rotten Tomatoes to include it among the Top 100 for the genre. Interestingly this includes both one of the genres with the highest number of reviews, Action & Adventure and one of the ones from the lower side of the spectrum for reviews, Sports & Fitness, so this can't simply be chalked up to being either a genre that doesn't get a lot of critical attention or one where only a small number of reviews are recorded.
- Let's switch this chart back to previous configuration where it was showing only genres with 100 rows. Follow the configuration above or just use Ctrl-Z for Undo. Tableau lets you move back extensively using Ctrl-Z so don't worry, if you do something that it turns out will mess up your graph you can go back with a minimum of headache. Retitle this sheet to fit the filter you've put on it: Number of Rotten Tomatoes (RT) Reviews for Top 100 RT Movies by Genre
Please note, filters will only be on the chart you put them on and don't transfer over to any other charts that you make from scratch. You can choose to duplicate a worksheet if you want to carry over your new filters to another worksheet.
Adding More Information to a Graph and Customizing it
in the last section we made a graph comparing top genre films in terms of how many reviews were up on Rotten Tomatoes for each genre and saw how to use filters.
Now let's create a chart that compares two measures, runtime and number of reviews. We'll also break this out by genre, but we're looking to see if it seems like these two things are related. A hypothesis we might have about these two measures is that movie genres with longer average run times might get less reviews, since they are more of a time commitment to watch, but lets see if the numbers bear that out with the hypothesis.
- We're starting from scratch this time so click on Worksheet at the top and choose New Worksheet.
- We want Runtime (in minutes) to be the data that's displayed along the bottom of the chart, so drag that item into the Columns container, and we want No. of Reviews to be displayed along the vertical axis of the chart so click on that menu and drag it into Rows.
- Let's break those lines out by Genre, and use Color as the way that different Genres are marked on our chart, by clicking on Genre and dragging it on top of Color on the Marks card. Whenever you want a given measure or attribute to be differentiated in a certain way in how it is symbolized on your graph, you'll be dragging and dropping it to the Marks card over the icon for whatever aspect of the graph you want to use to display it, Color or Size or Shape. Or you can choose to add the field's information as a Label or for display on the Tooltip you see when you hover over a mark on the graph.
- Speaking of symbology, by default, Tableau gave me a scatterplot where each genre is represented by a different shape. In this instance, I'd rather see this as a bar graph, so I'll click on the dropdown on the Marks card and choose Bar.
- Also, by default, Tableau chose to give us the Sum of each measure, which gives us kind of a nonsense chart, since the sum total of every movie's review and every movie's runtime just gives us a really high measure for both of those items, and since I know that there are some movies that have really long runtimes and some that have really high amounts of reviews from looking at my data, I know a graph with the sum total of each measure isn't going to tell me a lot about how much the average movie in a genre has been reviewed or how long the average movie is, so I'll want to apply the Average calculation for each measure.
- Click on each of these measures in the Columns and Rows holders respectively and switch their Measure to be Average instead of Sum. The numbers on the axes and the length of the the bars shift to reflect this new dimension. However, the bars are still clustered on one side of the chart which can make it kind of difficult to see where each genre is located since they overlap. Let's take a look at adjusting these axes.
- Double-click on the bottom axis and it will open the Edit Axis window. This is where you can do things like change the name of the axis, determine the range that it runs and where the tick marks are to tell the viewer what they are looking at.
- One of the very first things to clue us into why the bottom axis looks so crowded, is that the box next to Include Zero is checked. This means that where the axis for the runtime starts is 0. Now in some cases you might want your graph to start at 0 if the data you are covering occasionally would include a 0 but in this case, since what is being graphed is the average runtime of a movie...and a movie cannot have a negative runtime, 0 would not be represented. So uncheck this box, and you'll see the graph shift.
- The axis not starts with 95, and it's a little easier to see what the differences between each genre is, though there is still some overlap. Just to get the numbers a bit more orderly though, let's change the numbers for the start and end of this graph to give a small buffer. Currently the Range is set to Automatic. Click on the button next to Fixed, and pick 89 as the Fixed Start and 130 as the end.
- It's a little easier to see how the runtimes are distributed, but let's make the graph a bit wider. Currently, it's only taking up part of the screen and that's making the bars overlap quite a lot. At the top of the interface is a little dropdown that says Standard, click on it and choose Entire View and then the graph will stretch to take up the entire canvas and it's easier to see how the different runtimes and number of reviews are distributed.
- Since what is being measured at the bottom of the graph is the runtime in minutes, any reader will have to do the mental calculations of whether these measurements are an hour and a half or two hours since we can't change the way the numbers are formatted to display in hours and minutes. But what we can do is Add a Reference Line.
- Reference Lines allow you to put a line jutting out from one of your axes indicating a certain value. In this case we'll be adding it to show when an hour and a half and when two hours falls on the graph, but it can have all kinds of uses. Maybe you're looking at recorded temperatures for a given month and want to add in indicators of where the high, median and minimum fell in a different decade. Maybe you are making a series of graphs to see how men vs. women are represented in the artists that different museums have had on display over time and you'd like to put a reference line at 50% across all the graphs to better be able to see how close each have come to an even divide.
- In this case you'll be adding it as a reference for people to use to see how close these times are to a more general understanding of how long a movie is. Right-click on the bottom axis and choose Add Reference Line from the menu.
- Leave selected that you want to make a Line from the menu at top. In the dropdown next to Value choose Create a New Parameter. A Create Parameter window will pop up. For Name, write in An Hour and A Half, since that's the text you'll want to appear next to the reference line that you are adding. For Current Value, write in 90 since that's the number of minutes that make up an hour and a half.
- Once you select OK, you'll see a gray line appear on your graph. Repeat the process and create a new parameter named 2 Hours that is set to 120.
- Let's name this graph and change its title to match what it is now showing us. The name I'll choose for the tab is AvgTime&ReviewByGenre and at the top I'll title it Average Runtime and Number of Reviews by Genre.
From a look at this graph, it doesn't look like my hypothesis was verified. There are higher and lower bars (indicating a higher and lower number of average number of reviews) dotted along the horizontal axis that indicates how long these movies are on average. After all three of the genres with the highest numbers of reviews are all clustered near or past the two hour mark. It doesn't really indicate an inverse relationship (where longer runtimes means a larger number of reviews), since there are numerous places where movies with similar average runtimes have very different numbers of average reviews. Horror movies and Kids & Family films have an average runtime that differs by less than 1 minute, but their average number of reviews differ by about 20%. But an average can be dragged up or down by outliers, so let's take a more in depth look at how exactly those runtimes are distributed within a genre with a Box and Whisker Plot.
Creating a Box and Whisker Plot
The worksheets we've created so far have looked as a genre's measure as a whole, whether the measure being graphed was its sum or the average for all movies in a genre. They haven't displayed individual movies within those genre's lists' measures, so while this last graph showed averages, we don't know if there are outlier films with very short or very long runtimes that are dragging that average up or down. A Box and Whisker Plot is one of the possible graphs you can use to have a data point visualized for each of the individual items that make up your dataset (in this case it will be a movie, but it could just as easily be a person, a temperature reading, a book, a year's population data for a city, whatever individual row of information you have in a given category). A Box and Whisker Plot is good for when you want to see what the distribution is of a different measure within a certain category, in this instance what the runtimes for movies are in a given genre list. Are they clustered together around the same median point, or are they spread out more evenly? We'll also be showing another display option you can use on your Marks card to add more information on your graph.
- Start by opening a new worksheet by going to Worksheet from the menu up top and choosing New Worksheet (you can also do this with the icon at the bottom that looks like a worksheet with a little plus next to it, or by using Ctrl+M)
- Click on Genre and, like in our first graph, since we want this to be the measure that appears along the horizontal axis, drag it into the Columns container
- Click on Runtime and drag it into the Rows container, since we want this the measure to be displayed on the vertical axis
- Right now, it creates a bar graph, since it's displaying the runtime as an aggregate sum for each genre. We want each individual title's runtime to be displayed, so let's click where it says Title and drag it onto Detail on the Marks card, so the bars on this graph will be made up of individual segments with the details for each Title.
- Each of these segments is of different sizes depending on the runtime of the movie and some are larger than others. You can hover over any of the segments and it will tell you the movie's title and runtime. The movies are in alphabetical order rather than in order of how long they are so some differences are already pretty visible with different thicknesses appearing next to each other as shorter and longer movies are nested on top of each other.
- Let's convert this to a Box and Whisker Plot by going to the Show Me menu up at the top of the screen and clicking to open it. We haven't explored this menu that much, but it's a way that you can choose different kinds of graphs to display your data. The option you are currently using has a red box around it. Any option that is highlighted is one that will work with the data that you have already added to your worksheet. Any that are greyed out are ones that doesn't work with your current data you've added but if you hover it, it will tell you how many of each data type you'd need to use it. We'll be selecting box-and-whisker-plots, which is the second from the bottom on the right hand side of the options.
- When you click on it you'll see that the graph instantly changes to use the box-and-whisker mechanism rather than the bar graphs.
- We'll be changing some aspects of this graph to display more information, but the basic mechanics are: each dot represents a movie, and how high or low it is placed depends on its runtime.
- The box (the two-tone grey rectangle) of each plot represents the movies that occur around the median or the IQR (Inter-Quartile Range)
- The darker grey section encompasses the 25% of the movies that are shorter than the median runtime, and the lighter grey has 25% of the movies that are longer than the median runtime.
- The whiskers (the long black vertical lines that extend from the box and end in a horizontal line) mark all the movies that have runtimes that are within 1.5 times the span of the box. This goes both higher or lower than the outer edges of the box
- Any dots that occur below or above the marks are outliers.
- For example, for Kids & Family the median runtime was 100 minutes, so that's the line in the middle of the box.
- 25% of the movies with longer runtimes than 100 minutes had a runtimes between 100 minutes and 105.5 minutes so the upper hinge of the box stopped there.
- 25% of movies with shorter runtimes than the median had runtimes under 100 minutes but at least 90 minutes, so the lower hinge of the box stopped at 90 minutes.
- The box has 50% of all the movies in it, and the difference between the shortest and longest of them is 15.5 minutes. 1.5 times 15.5 is 23.25 so the upper whisker stops at the movie closest to 128.75 (105.5 + 23.25) minutes at the top (in this case Hugo at 126 minutes) and the lower whisker stops at the movie closest to 66.75 ( 90 - 23.25) minutes at the bottom (in this case Bambi which is 70 minutes).
- What that means to you non-math folks (like me) is movies where that box is shorter, have a denser distribution of runtimes where most runtimes are near the median. They also have smaller boxes and smaller whiskers. Look at the plot for Animation to see an example of this. It has a lower median runtime than nearly any other genre, and its runtimes are so tightly distributed that its upper whisker is below many of the genre's boxes. Meaning the movies on the longer than usual side for runtime for Animation are lower than the high end of the usual range for most of the other genres. Animation has a smaller box where both the top and bottom seem nearly equally distributed, which seems to indicate a consistent duration for films on the Animation top 100 list.
- On the other side, of it, you can take a look at the movies for Classics. In this one the higher end (light grey) of the box containing the usual range for this genre extends much farther than the lower (dark grey) end, indicating that 25% of the remaining movies that are above the median runtime for Classics of 104 minutes have a much wider variation (up to 125 minutes!) before they hit 25% of the movies over 104 minutes, than ones that have a runtime under 104 minutes.
- Right now this graph is telling us about how runtimes are distributed within genre, but not a lot else. Let's add more information to it. One thing we haven't dealt with yet is how the movies are ranked. Like are highly ranked Horror movies, ones that are on the shorter side of runtimes or are they the ones on the longer side? Or does runtime not matter at all? We don't know from looking at this graph, so let's add symbolism to show what the rank is for each of the movies on this graph is.
- Click on Rank and drag it on top of Size on the Marks card, and you'll see each of the dots become different sizes, but they don't seem to be dramatically different which can make it hard to tell exactly what a given range of dots' rank is. Also when you look over at the legend on the left-hand side of the chart you'll see something that should give you pause, which is movies that have 1 as their rank are the smallest dots on the chart, while the ones that are ranked as 100 (i.e. nearly didn't make the chart) are the largest dot. That is a system that would confuse the casual viewer of the chart who'd probably think that a larger dot would correspond with a larger importance to Rotten Tomatoes's reviewers. We're going to need to do some editing to fix this.
- Select the arrow on the side of the legend marked SUM(Rank) and choose Edit Sizes. In the Edit Sizes window that appears, switch the dropdown for Sizes vary: to select By range and choose the check box for Reversed so that it's actually the dots on the higher side of the spectrum that are the smallest. Next, drag the right arrow on the range in Mark size range bar over so that in the Sample legend: on the side the dot for 1 becomes markedly larger than the one for 100. Choose Apply and you'll see the graph shift. To be noted, it is the center of each of these dots that indicates what its runtime is.
- It can be hard to see the individual differences between the dots, especially now that they are bigger. Click on Color on the Marks card and on the Border dropdown choose the black color swatch, which will put a black border around each of the dots.
- Just to make this whole thing easier to read, go to the dropdown at the top of the screen that says Standard and switch it to say Entire View so that the graph takes up the entire work canvas.
- Looking at the graph with this new information added, we can now see if there are any trends in the runtimes of the the higher and lower ranked movies within the genre. For instance, in Horror where there is a very tight distribution around the median, there are higher ranked movies on either side of it, though medium ranked movies right on it, while in Musical and Performing, with a larger range of movies in its middle 50%, a lot of the movies near the median are the higher-ranked movies.
- It's also interesting to see how rankings fare amongst the outliers. For instance in Action & Adventure, the outliers are all highly ranked and all on the longer side of the runtime scale, whereas, for Kids & Family, the two outliers on the shorter runtime side are both highly ranked, but the ones on the higher side of the runtime scale are all lower-ranked.
- On our last graph, we were also looking at the No. of Reviews associated with a title to see if there was any kind of relationship between the number of reviews a movie had on the site and its runtime, and while we were dealing with the average of both there didn't seem to be a relationship, but at the risk of making our graph too busy, let's see what adding No. of Reviews to our Marks card yields. Let's add it to Color.
- Click on No. of Reviews and drag it on top of Color on the Marks card and you'll see the colors of the dots shift. A legend appears on the side letting us know that dots that are a darker blue have a higher total number of reviews and dots that are a lighter blue have a lower total number of reviews, and I'm afraid that this seems to be the last straw for my hypothesis of shorter runtime = more reviews, at least in some of the genres. The plot for for Science Fiction & Fantasy seems to be especially dismissive of that hypothesis, it's actually the movies on the higher section of the runtime that have the higher number of reviews. The new display also seems to make clear that some genres (even when drilled down to the individual movie level, rather than as a genre-wide average or sum) have much fewer reviews on Rotten Tomatoes than others.
- This graph has the same issue as the previous one however, in that the runtime is listed in minutes, making the user do the calculation themselves on how many hours it is. Fortunately the parameters for Reference Lines that we created on the last graph are ones we can easily add to this one. Right click on the axis for Runtime (in minutes) and choose Add Reference Line. Add the reference lines you've already created for An Hour and a Half and Two Hours and since this range is a lot wider than the one on the previous graph, add new parameters for An Hour at 60 and one for Two and a Half Hours at 150.
- Adding these parameters makes it a bit easier to eyeball that most genres have median runtimes somewhere between an hour and a half and two hours, with the exception of Action & Adventure and Drama. It also makes it easier to notice that only one genre, Classics has a whisker that extends below the hour mark, and only a few have more than one film that falls below the hour mark. Think on what else you notice about different genres from this graph. Next, we'll be creating a scatter plot that lets us more directly compare how two of these measures, runtimes and reviews, relate.
- Before we create a new tab, let's rename this one at the bottom to BoxAndWhiskerRuntime and change its title at the top to be Distribution of Runtimes by genre, since that's the primary measure that it is showing. It's good practice to make sure that all your tabs have informative names and titles, so as you click through a workbook, you aren't faced with a sea of tabs at the bottom that just says Sheet1, Sheet4 and leaves you not sure what sheet is where.
Creating a Scatter Plot
A scatter plot is a graph that shows where an individual datapoint appears in regard to two different measures. In this case, just like on the graph we just made, we'll be making the movies (or more specifically the Title) our datapoints, but we'll make the two axes the Runtime and the No. Of Reviews. So for each movie, the higher it is on the vertical axis, the higher the number of reviews it has, and the further it is to the right on the X axis, the longer runtime it has. If we see any clustering or any clear trend, then that will tell us once and for all whether there seems to be any relationship between these two things.
- Start by opening a new worksheet by going to Worksheet from the menu up top and choosing New Worksheet.
- Since we want the measure across the X axis to be Runtime (in minutes), click on it and drag it into the Columns container in the top of the sheet. It says Sum and is aggregating all of the movies, but don't worry, we'll deal with that in a few steps.
- No. of Reviews is the measure that we want to have as our Y axis, so click and drag that into the Rows container at the top of the sheet. . This creates a weird situation where only one dot appears in the top right of the sheet, but we're about to fix that.
- Just like with the box-and-whisker plot, in order to get a datapoint for each individual movie, we'll click on Title and drag it on top of the Details option on the Marks card. You'll see a cluster at the bottom left of the sheet, but the runtimes look mighty strange, stretching to over 800 minutes, which we hadn't seen on the other sheet at all. Click on the furthermost circle, Seven Samurai, click to View Data, choose Full Data and you'll see why.... the movie is on four different Genre lists, so the number for its reviews and its runtime has been multiplied by 4.
- Let's scroll through this data to see what fields we can use differentiate between different instances of this movie appearing in our data. The only unique fields for each row about this movie (since it's literally the same movie it has the same Id, Production Country, Release date, Rating Tomatometer and so on) are either Rank or Genre. Close the View Data window.
- The primary difference I'll choose is the one we've been including on all our sheets so far, Genre. This is the surefire way to make sure that there's a unique datapoint for the movie for each time that it appears on the chart, because, well, it's only going to be on each Top 100 of a genre list one time. Rank might also work but it isn't impossible that a very well-liked movie that fits into multiple genres could be say, #5 on more than one genre's list.
- Click on Genre and drag it onto the Colors box on the Marks card to tell it to make each of these dots as a different color based on the Genre that it falls into. You'll see that the graph shifts dramatically, with the highest concentration of movies being somewhere between 80 and 120 minutes, and being between 40 and 70 reviews. But it is very densely packed, so it can be hard to see what exactly is going on.
- First we'll try to help this by changing the viewpoint of the graph so it takes up the whole screen. Click on the Standard dropdown at the top and choose Entire View to expand the portion of the screen that the graph takes up. This helps somewhat, but if we scroll over to find our test case from last time, Seven Samurai (located just under 210 minutes, and a bit under 100 reviews), we'll see that its kind of a muddy blue-ish gold color. That's because since it's located on several lists, the colors signifying the genres whose lists Seven Samurai is on are just set up on top of each other, oops.
- From our look at that movie we know that the other field we can almost certainly count on to be unique about each row for a film is its Rank, so click on Rank and like we did on the last graph, drag that on top of the Size box on the Marks card, and the graph shifts again. But in the legend on the side, it's been set up to the same default as before where 1 is the smallest dot and 100 is the largest, so clearly we'll want to reverse that.
- Click on the arrow on the SUM (Rank) legend and choose Edit Sizes. Click on the dropdown next to Sizes vary: and choose By range. Next, click on the box that says Reversed and choose Apply and the graph will change to reflect this.
- Let's also add the reference lines for different runtimes that we made to better help us what different time ranges we are talking about here. Right-click on the bottom axis and choose Add Reference Line, to add the reference lines for An Hour, An Hour and A Half, Two and a Half Hours and Two Hours.
- By looking at our test case again, Seven Samurai, we can now see that it's an interlocking series of rings of different sizes that looks kind of like a target. In fact, there are a bunch of items that look like that throughout the graph. By hovering on them we can see that a lot of them are in a similar situation to Seven Samurai where they're representing one movie that occupies a very different spot on multiple lists, but some are ranked so close to the same on their different lists that it can be hard to see what genres make up the circle. Similarly, some are just different movies that have a similar enough runtime and number of reviews, and a lower rank, like the mess near 70 minutes and around 50 reviews that comprises multiple movies.
- However, before we figure out another configuration let's see what we can tell about our data from this graph. Conversely to what I was beginning to think about there not really being any relationship between a movies runtime and its number of reviews, I believe I can notice a few consistencies.
- The densest configuration of runtimes is that between an hour and a half and two hours, and there are movies with runtimes falling within this all up and down the possible numbers of reviews
- When runtimes movies go much past two hours, the number of reviews clusters at the top and bottom of the scale.
- Movies also seem to be more likely to have fewer than 100 reviews when they go significantly below around 80 minutes and to a lesser extent, past two and a half hours.
- What have you noticed from this graph?
- Let's move on to the next graph, but first, let's name the graph in its tab and change its title. Let's name the tab ScatterRuntimeReviews and make the title Runtime and Number of Reviews by Genre
As neat as it looks to have all our genres on the same graph (seriously, it looks like some 1960s shower curtain!), it isn't really giving us as clear of a picture as we need to make much more observations than that, so we have a few options we can exercise from here. One is creating a filtered version of this graph. The other is to use the Pages function so that we can separate the graphs out by genre and look at them one at a time.
Making a Filtered Scatterplot
Let's use the duplicate function to keep this original graph untouched and make one that only includes the data for the movies on the Horror, Romance and Science Fiction & Fantasy lists. Maybe this is for a project looking at these popular genres that aren't always taken seriously. Maybe I'm just curious because I would think that these three wouldn't have a lot of overlap and I want to see if that's true.
- Right-click on the tab for this sheet ScatterRuntimeReviews and choose Duplicate.
- Rename the new sheet that appears as a result ScatterRomHorrScifi, and for the title at the top change it to Runtime and Number of Reviews for Horror, Science Fiction and Romance
- Click on Genre on the Tables section at the side and drag it onto the Filters card
- In the Filter window that opens, choose to select None, and then check the boxes for Horror, Romance and Science Fiction & Fantasy. Click Apply and you'll see the graph change to only show those options.
- Take a look at the results to see what more you can notice now that it's been narrowed down to only these three genres. For one, it does look as though there is a little overlap.
- There are movies on both Science Fiction & Fantasy and Romance (Groundhog Day, and Shape of Water for two) lists, and a movies on both Horror and Science Fiction & Fantasy lists (Aliens and Bride of Frankenstein both being examples of that).
- But I don't think that I found any overlap of the Romance and Horror lists.
- In terms of runtime it looks like Horror films are mostly concentrated on the lower side of the scale, with only one going past the two and a half hour mark, while both Romance and especially Science Fiction & Fantasy still have relatively numerous and highly ranked movies past the two hour mark and in the case of Science Fiction & Fantasy even a handful of highly reviewed and ranked movies past 2 1/2 hours.
- Speaking of those reviews, a glance shows that there are more higher ranked Romance movies in the lower reviewed section, and to a lesser extent, Horror as well, whereas for Science Fiction a lot of the larger dots indicating a higher rank are in the top half of the graph, indicating they have a higher number of reviews. Let's do another filter to see if this is accurate by filtering this graph further to only show the top 25 movies from each list.
- Click on Rank and drag it onto the Filter card. You'll see that the window that appears looks different, since in this case you're filtering based on something that is a measure (or quantitative data) rather than a dimension (or qualitative data). Click where it says All Values and choose next.
- Leave Range of Values highlighted at the top, since you are going to tell it to look at only a certain range. In the two boxes 1 is in the left hand box, and 100 in the right. Change 100 to instead say 25, and select Apply. The graph will change again to only show that smaller amount of information. Additionally, the size configuration also changes, to only have the sizes go from 1-25 instead of from 1-100.
- Actually in this one that size for 25 is so small that I'll actually go into Edit Size to tweak it upwards. Feel free to do the same with the process you used to edit the size range originally.
- Now what can we see from this revised graph that only shows the top quarter of each of these lists?
- Well, the surmise that top-ranked Science Fiction & Fantasy movies are also the highly reviewed ones seems to be borne out (If you'll look at the names of the movies by hovering over the dot, you'll notice that nearly all are comic-book movies, or sequels to legacy sci-fi movies) Less than a quarter of the top 25 sci-fi movies appear to be in the lower third of the graph, indicating a small number of reviews.
- For Horror, the opposite seems to be true, there are only 2 movies in the top third of the graph, and only one for Romance. Romance especially seems to be mostly in the lower third of the graph in terms of reviews.
- Let's change the title of our graph to reflect what's actually in it, Runtime and Number of Reviews for the Top 25 Horror, Science Fiction and Romance, and move on to our next exercise.
While Filters can be good to let you look at a smaller sections of your data, that might not be what you want to do in all situations. Instead of comparing genres directly to each other, with all the overlap that can create, maybe we just want to look at each genre in their entirety one at a time. Now, we could create a separate graph just for each genre, but that would be very tedious. Alternately, we can use the Pages function as we will in the next section to do this in just a few clicks.
Using the Pages Function
Pages lets you create multiple views within the same worksheet, that you can move through using left and right arrows, just like if you were looking through pages in a book. We are going to set this up so that we can look at each genre on our scatterplot graph one at a time. Let's start from previous scatterplot that contains all the genres.
- Right-click on ScatterRuntimeReviews and choose Duplicate. Immediately change its title in the tab to add - Pages to the end so you'll be able to tell the difference when looking through tabs.
- We'll want to be setting up the Pages to each contain a different Genre's data so click on Genre in the table section and drag it over to the Pages card.
- Immediately, most of the circles for the movies disappear, except for those for Action & Adventure . A little box marked Genre with right and left arrows next to a dropdown appears the the bottom right of the screen. Click on the right arrow, and the graph will shift to instead display the next option for Genre, Animation. This box also has a little slider that you can use to move through the different genres manually, or you can use the play button just below the slider for it to play through all of the genres with a pause at each.
- When you are only comparing a genre to itself, what interesting things can you see about them
- I'm still finding it interesting that a significant segment of the Horror list seems to not have many reviews, and large segment of it has below a two hour runtime.
- Western also resides mostly within the lower-reviewed section of the graph but its runtimes seem to be shifted over a half hour longer than Horror.
- For several of the genres (Mystery, Drama, Science Fiction) there seems to be a cluster on the low-end of the review spectrum, and another cluster at top but with comparatively few movies in that middle 200-300 reviews range, while in other genres the movies are either lower reviewed in general, or the amount of reviews a movie got is distributed more evenly like Animation.
- What do you notice?
The Pages function can be used to sort things by a particular dimension, but also you can use it with a filter on it to actually only capture a sub-set of a given dimension.
Adding a Filter to Pages
Remember how we noticed that some of the datapoints for movies looked like little bullseyes because of how many lists the movies were on? Let's create a pages function to only show the movies that appear on a certain number of lists and page through them. I'm curious to see which movies had these cross-genre appeal, and what genres they were in.
- Let's continue starting from the graph where we've already set up with the colors and sizes how we like them, and go back over to ScatterRuntimeReviews, right-click on it and choose Duplicate
- Click on the tab of this worksheet and change its name to MultipleListMovies so that we don't lose track of what each of these tabs contains.
- Since we want to create a different Page for each of the movies that's on multiple lists, click on Title on the Tables list on the side and drag it onto the Pages card.
- Nearly all dots will disappear because it's just literally created a page for every single movie. Click on the Title box in the Pages card and click on the tab that says Condition. Remember, this Condition option is what you used before to only look at a Genre if it met the condition of having 100 rows in their list? So this is where you'll use to only include a Title if it meets the condition of being on 4 or more Genre lists.
- Click on the dot next to By field and choose the dropdown option for Genre because that's what you want it to look at, Count because you want it to count how many Genres a given Title appears on the list for. Select >= from the dropdown, and type in 4, since you only want it to include in a Title if there are at least 4 Genres it's on the list for.
- Select Apply. A box marked Title with left and right arrows for you to use to maneuver through the Pages that Tableau has created will appear on the bottom right hand margin. There will be a slider to let you scroll through each of the titles that match this condition and a dropdown to let you look through the list. In mine, the first title that appears is 101 Dalmations, and the graph now only contains one interlocked circle. Select Okay.
- We've solved the problem of having a lot of different interlocking circles on one map, however when we move through the Pages some of the genres that a title can fall into are still hard to determine as its rank might be similar on each list it is on.
- Let's think about what the information that we want for each of the movies is. When creating our previous version of the graph, we noted that the only way we could differentiate between different instances of the same movie was Genre and Rank. Currently our graph has one axis set up to be the Runtime which of course...is going to be the same for each title regardless of the list its on, so let's replace that axis with Rank.
- Click on SUM(Runtime(in minutes)) in the Columns container and choose that you want to remove it.
- Replace it with Rank by clicking on that measure and dragging it into the Columns container
- Next, since Genre is the other thing that will differentiate between each instance of a movie on its 4 or more lists, remove SUM(no. of Reviews) from the Rows container and replace it with Genre
- Personally, I think it makes more sense for Genre to continue to be on the bottom as we've done on previous, so I'll swap the columns and rows with the icon at the top just to the left of the ones that sort axes.
- I'll also switch it back to Circle using the dropdown on the Marks card because for me, Tableau switched to a bar graph for some reason, and since I think that provides a misleading picture, as a longer bar is actually going to be associated with a lower ranked film.
- The small dots look a little small for me, so I'll choose Edit Sizes and tweak the button on the Smallest end of the scale so that it makes the ones that are registered even as far down as 100 easily visible.
- Once you've made these alterations use the arrows on the Title box to flip through the movies on 4 or more lists. What do you notice?
- I've taken note of the fact that the graph for some movies consists of a lot of large dots towards the bottom of the screen indicating that it is ranked highly on all of the lists that it is on, whereas for other movies like Groundhog Day, it is ranked from 7 all the way to 62, depending on what list it is on.
- Take note and be constantly thinking about if your graph is actually displaying the information in an easily understandable manner. I think it makes sense to have both the size of the dot representing the movie, and its placement on the graph represent what rank it is on that particular list, but maybe you decide that for your reader, that would be too confusing.
- Change the title of the graph at the top to reflect what it now shows which is Ranks for Movies on 4 or More Lists
Saving Your Work
Now it has come time to save your work.
- Go to File and choose Save to Tableau Public as
- A pop-up window will appear asking you to login to the the account that you set up at the beginning of this tutorial. Log in to the account
- A window will pop up giving you the caveat that you should make sure that this graph doesn't contain private or confidential information as it's going to be on the Tableau Public website and asking you what you want to name it. For this data the bit about it not being private doesn't matter, but note that if you are dealing with any private data, you'll want to either use a different program, or spring for one of the other Tableau options that is not free, like Desktop or Server.
- Give the workbook a name, I'll name it RottenTomatoesTopMovies-Lesson1 and choose Save.
- It will process the request and then when it is done, it will open the workbook you've made in a new tab of whatever web browser you have open. Whenever you save your work a new version of this window will appear with whatever changes you have made.
- To note, it will only display the actual worksheet that you have open in your web browser so if you want to see one of the different ones in the Tableau Public websites, you'll need to click into that worksheet and choose Save to Tableau Public. This seems to be a more recent change, so though it is cumbersome, maybe it is just a temporary bug.
- If you want to use an image of one of the graphs that you've made in a paper or slide presentation, you can either take a screenshot or use the Download option, which is the third icon from the right of an arrow in a box on the website, and choose to download as an image or PDF or Powerpoint, whatever format will help you. Note that all of these things are going to be static, however, and not be interactive.
Here's the link to my workbook that you can go to and download my workbook I made in this tutorial if you got tripped up at any point and want to see how I made it work. In fact, a nice thing about Tableau is that often you can download existing graphs that you like and in a lot of cases see how people whose graphs you'd like to emulate, set up their graphs. Check out their gallery for examples of the work that some people have done (of course, keep in mind that some of them might be using Tableau Desktop or one of the other paid options, so you might not be able to duplicate every neat graph that you find).
Click to the next tab to work more with Tableau Public, but this next lesson will be focused on graphing data where time is one of the dimensions.