Skip to Main Content

DAsH

Research Guide for DAsH (or digital humanities) resources and tools

Cleaning and Organizing Data In Sheets

If you are working on a DAsH project that has involved you grabbing more data than you know what to do with, Google Sheets has a lot of tips and tricks that can help you make sense of that data. You have access to Google Sheets by virtue of your Google Drive at the college so it is free, but it should be noted that the information in it does live in Google's server, so if you have confidential information that you want to keep only on your computer, this isn't the way to go. Additionally, while in their help section Google says say they support up to 5 million cells in Sheets, in reality, if you will be using things like formulas and conditional formatting (as we will be doing below) it will get slow at a lot lower of a cell count than that.  However, I have chosen this program to use to talk about helpful formulas and formatting tricks since it's free and relatively easy, and a lot of the formulas being used have some equivalent in Excel or OpenOffice. 

Using Google Sheets To Clean and Organize Your Data

Learning Goals

Sometimes the data that you will want to use with your project will be messy. In the case of the project that I will be using as an example for these tutorials, the data is messy because it was web-scraped from a site where the information I wanted wasn't set up in a nice neat table.  In other cases, you'll be trying to put together data that came from different sources, data that may contain duplicates or data that was collected without concern for consistency in what labels were used for different events or classifications.

In order to take that raw data and turn it into something that can be graphed or analyzed in other ways, your data has to be organized. Figures in a given column all have to mean the same thing. Choices have to be made about which rows are duplicates and should be deleted from the sheet. Extra formatting characters that just confuse things need to be gotten rid of before they confuse your analysis programs or your readers.

In this lesson, we'll use Search and Replace to get rid of extra characters cluttering up our data and Conditional Formatting to make patterns in our data visible with colors. We'll also talk about how Sort and Filter differ, and how each can be used in different ways to help you see what you have in your sheet. 

Getting Started

  • Go to this linkIf you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson1_Start. Choose Make a Copy. 
  • This will open up the Google Sheets document I have created. I have some minimal information in the header row, since I don't quite know yet exactly how many columns will be needed, but it will still be helpful to have the header at the top of my screen no matter how many rows are scrolled through, so I'll want to freeze it.
  • Select the whole top row by clicking on the 1 on the left side of the spreadsheet. Go to the View menu, choose Freeze  and select 1 row. Now as we scroll down through the sheet the header row will remain on top. 
  • Each row contains the information from the page for an issue released in 2001 by Marvel Comics from the fan -created Marvel Wikia site obtained using a web-scraping script.  More information on web-scraping is available on this other tutorial on the DAsH site.
  • Some cleaning has already been done to this data in a different fashion to try to extract only the link to the image, the volume and title information, and the names of the writers and pencilers (artists). 
    The page on the Marvel Wikia site with indications of the above information that was extracted.
  • Most of the cells are formatted like they are part of a web address with underscores instead of spaces and slashes to separate parts of an address from each other.
    /wiki/Category:Week_45 _2000
    /wiki/Category:Marvel_Staff/Writers /wiki/Andy_Lanning
    /wiki/Category:Marvel_Staff/Writers /wiki/Grant_Morrison
    That's because the information in these cells was part of web addresses on the original site, but we want to make the content in these cells more legible to both us humans who are working with the data and any graphing or analysis programs we want to send this data to. Time to get cleaning our data!

Using Search and Replace to Batch-Edit Your Sheet

You may be familiar with using Search to look for keywords in a document, but Sheets (along with most other editing softwares) will let you not only look for a word or character, but will let you replace it with something else. Throughout this sheet, there are lots of formatting peculiarities due to the information originally being web addresses that are consistent both in how they appear and that they don't serve any useful function, so they are ideal to take out with a Find and Replace. 

  • Before you do this kind of batch-edit, it's important to make sure that there is a raw version of your data somewhere that you haven't touched. In this case, you're working from a sheet that exists on my Google Drive, so it doesn't matter, if you need to backtrack past what a simple Undo command can accomplish, you can just use the link to make another copy of the sheet. This is still a principle you should keep in mind for your own projects though. Keep a copy of your untouched data.
  • Column A contains the Image URL that you'll want to remain functional, so you'll want to highlight all the other columns that contain data, in this case that's B-AK . I figured out that AK was the last column by moving across the cells in the header and using CTRL and a down arrow to move to the last cell with data in a column (It's CTRL+Option+Down on a mac) and that column was the last one that contained data. 
  • With all of those columns highlighted, we'll go to the Edit option on the menu and choose Find and Replace. You can also do this with CTRL H
  • In the Find box put in _  and for Replace with just put in a space. Make sure that the Search option has Specific Range chosen from the dropdown and Sheet1!B:AK as the range. Then choose Replace All. Now the cells look a bit better formatted.
  • The next bit of weird formatting that needs to be handled is the /wiki/ that is at the beginning of most of the entries because they were web addresses. Just make sure all of the same range is highlighted, go to Find and Replace and ask it to Find /wiki/ and leave the Replace with field blank. 
  • Lastly, in scrolling through, I noticed that some cells also had  http://marvel.wikia.com at the beginning of the link. This is also extra information that we want to be rid of, so let's do the same trick of using Find and Replace to Find  http://marvel.wikia.com and Replace with a blank. 
  • Scrolling through, we'll observe that the names of the volumes and the writers and pencilers are now more easily readable.

Using Conditional Formatting To Call Attention to Parts of Your Data 

Now that we've taken out parts of the text in this sheet that we do not need, there are other parts of it that we want to call attention to. We can use conditional formatting to do so. This is a feature on a Sheets as well as other spreadsheet programs that will format cells that have certain kind of content in them differently from others that don't have duplicates. We'll use it in this instance to make cells that contain duplicates of other cells in their column in one case (and row in another) a yellow color so we can decide what to do with the duplicates. We'll also use conditional formatting to make cells with Writer or Penciler in them different colors which will make it easier for me to do the next steps of getting the writers in one set of columns, and pencilers in another. 

  • Start by highlighting the entire sheet by clicking on the button in the top left corner where the rows and columns meet.
  • Go to Format and choose Conditional Formatting. This will open a new window on the right-hand side of the sheet. By default, it is going with formatting the cells if they are not empty.
  • Click on the dropdown below Format cells if and take a look at the options. You can highlight cells that contain, start, or end with certain letters or words, ones that are a certain date or are before or after a certain date, or ones that are above, below or exactly a certain numerical value. While in this case we are working with text primarily, if you are working with numerical values, and need a way to visually see where numbers in a column are above or below a certain threshold at a glance, this can be an invaluable tool. 
  • In our case, we are just looking to highlight cells that contain a certain value which is the word Writers. The way that the site I got the information from set up, the link Category:Marvel Staff/Writers will appear right before whoever is credited as writer. As you may have noticed, there aren't a consistent number of columns for each row which means I'll have to do some housecleaning and moving around of items. That will be made a lot easier if every square that has the link to Writers is highlighted green so that I know that the cell after that has the name of the writer(s) for the issue. 
  • From the dropdown, choose Text Contains and in the box below it, put in Writers. Leave the default color of green as the color you'd like it highlighted. There will be a progress bar at the top as it works to do this, and when it finishes, all of the boxes that have Category:Marvel Staff/Writers in them will have a background color of green. Select Done.
  • The window for conditional formatting will change to just show you the one rule that you have created, below that will be an option to Add Another Rule. Select it. 
  • In the dropdown for when to format cells, choose Text Contains and in the box below it, type Pencilers. Then change the formatting style in the dropdown to select the pink color. After it's done processing the sheet will have all cells with Pencilers in them highlighted pink, thus letting us know that the square after that pink cell contains the name of the artist(s) for that issue. Choose Done.
  • Next up, let's create a conditional formatting rule that will highlight when there are duplicates within a column. This is a request I've gotten previously from students as there can be situations, like say, automated data collecting where there may be multiple entries on the same date or at the same locations where a decision might have to be made on which row to use in the final data. 
  • In this case, the example I'll use is column A where I have the image links for the comics pages stored. I'd noticed while looking at my data that sometimes the same image link appeared more than once. Since I don't know whether the link being used more than once is an indication that the same issue somehow had two pages assigned to it, or if it were some kind of separate issue that reused a cover image and that's why the URL appeared twice, I want to take a closer look at those lines. So I don't have to just scroll through squinting at different lines, I can use conditional formatting to highlight where there are duplicates within the column. 
  • Click to highlight just column A, then in the Conditional Formatting window, choose Add Another Rule.
  • From the dropdown choose Custom Formula, and in the field, paste in =countif(A:A,A1)>1 We'll get into the COUNTIF formula in a later module of this tutorial, but basically what this formula is saying is for each cell in the column covered by the conditional formatting, if a search of the column shows that there is at least one other cell with the same content, that it will apply conditional formatting to that cell. 
  • Make the color you want to change it to yellow, and choose done. 
  • Just the way that we had the formula look for duplicates within a column, now we can do the same thing across a row. Another thing I noticed while looking at my data was that sometimes the same writer or penciler might be credited multiple times on an issue. To keep my data consistent I want to get rid of these duplicates, which will be a lot easier if they are highlighted. 
  • Since this is occurring across the whole sheet, highlight the whole thing again with the button where the rows and columns meet. In the Conditional Formatting window, choose to Add Another Rule. Select Custom Formula from the dropdown and paste in =countif(1:1,A1)>1 We'll be getting to COUNTIF later, but basically,  the same way that the previous formula for columns searched that column for instances where one cell matched another in a column, this one will search through a row, starting with A1 and if the values in a cell are the same as one in another cell in that row, it will apply conditional formatting to those cells. Choose yellow as the color you want it to apply. Choose Done. 
  • A couple of important things to note
    • This formatting (and any conditional formatting across a whole sheet) is going to slow down any movement or calculations you have to do on the sheet, so if it's something you have to apply in order to do some edits, it's best that you remove the formatting once you are done.  You can remove a rule by hovering over it in the Conditional Format rules window and clicking the trash can next to it.
    • Formatting rules are applied in the order that they appear on the list. So in this case, the rule that highlights cells with Writers in them green is at the top of the list, so in the instances that there are multiple cells within a row that have Writers in them, they'll appear as green, even though there is the rule below it that says that cells with the same value within a row should appear yellow. A rule on top of another rule in the list wins out when both would apply to the same cell. Keep this in mind as you are applying conditional formatting, if multiple formattings could apply to the same cell, put the one would you like Sheets to prioritize on top of the other ones on the list. If you want to change the order the rules are applied, click on the three dots to the left of the rule, and drag it to where you would like it to be placed on the list. 
      The Conditional Format rules list with features described.

Now that we've applied Conditional Formatting to make it easier to see the cells of interest to us as we edit, let's use the Sort and Filter tools to organize our sheet in a different order or to only show rows with certain values in them. 

Using Sort and Filter to Reorder your Rows

Sometimes you need to change how the rows in your sheet are ordered to edit it more quickly or make things about your data more clear. A lot of you will have probably used Sort at some point to put items in your spreadsheet into alphabetical or numerical order, but just to quickly run down how, let's say that we wanted order the sheet by column A in alphabetical order so we'd know that any duplicate cells in that column will be next to each other. 

  • Highlight everything on the sheet. 
  • Go to Data and Sort Range. In the window that pops up, check the box Data has header row so that it doesn't pull the header at the top and put it in alphabetical order. From the dropdown for Sort By choose Image URL(the first column) and choose A to Z
  • The sheet will reorder to put itself in order by the Image URL column.
  • Now, say that I've noticed that there are some rows in here that don't have writer or artist information for their issue. Because they have less information, they have cells in a smaller number of columns overall than other rows. So after about 8 columns in (Column H) a row without creator information will be blank. If I want to sort all the rows with blanks in column H next to each other to make them easier to delete or move to another sheet of issues I'll have to look up information on, I can sort by that column instead. 
  • With the whole sheet still highlighted, choose Data from the menu and choose Sort Range, and make sure that the box Data has header row is checked. This time choose that you want it to Sort By Column H and select that you want it sorted A to Z. 
  • Scroll to the bottom of the table and you'll see that the rows with blanks in that column are grouped together at the bottom for you to inspect and delete or to move to another sheet. You don't actually have to do this process of deleting yourself, the next lesson will start off with a sheet that I've cleaned using these tools.

Sorting changes the order that the rows are in on your sheet. But if you just want to focus in on rows with only certain values in a given column or columns without changing the order that rows are actually in on the sheet, you can use filters. 

  • To use a filter, first highlight just the top row of the sheet. Then select the button on the menu bar that looks like a little funnel. It will take a second, but then each of those cells in the header column will have an arrow in it. That means you can choose to only include rows with certain values in that column.
  • For example, in the Month column, if I wanted to only see the issues that came out in December for some reason, I could click on the arrow for that column, in Filter by Values section of that menu leave only December with a check next to i.
  • It's important to note filtering doesn't actually change how things are organized within the sheet. It is just presenting you with a filtered view of the sheet. It's still recommended to use sort to change how the sheet is ordered if you want to do things like move items from one column to another to avoid any unpleasant surprises in rows you might be highlighting but are hidden when you use a filtered view. 
  • In addition to values, you can filter by color. Like in the Image URL column, if we wanted to only see the cells that were colored yellow because they were duplicates, we would go to the top of that column, click on the filter arrow and choose Filter by Color, and then click on the radio button next to yellow.
  • You can also select Sort by color on this window too. 
  • Once you've filtered by color to only see the items that are formatted with yellow background color due to being duplicates, it can help you make decisions about which rows to keep and which not to because your view is narrowed down to just those items. In this case, it looks like one of the issues that had the same Image URL as another came out a different month, and the other issue with a duplicate Image URL has pretty much all the same information, so I'll decide to keep both entries in the first case, and only one of them in the second case. 

With this module, we've looked into how you can use Find and Replace to make batch-editing changes to clean up your data, and Conditional Formatting, Sort and Filter to better see what other places in your spreadsheet might need more attention. If you had any issues throughout the process and want to see a copy of what I did, go to this link and select that you want to make a copy of the spreadsheet Marvel 2001_Lesson1_End. Once you've made a copy, you'll be able to see what I did on my version of the sheet.  

Learning Goals

In the previous module, we've gone over some basic tools to help you clean up your data that's already there. With this section, we'll work on what you do when you want to make new columns with data based on information you already have on your sheet. Sometimes you got the data in whatever format you could and now you need to make it consistent for whatever analysis or graphing program you are using. We'll work on how to use formulas like RIGHT and LEFT in order to derive a set amount of text from the beginning or end of a cell and place that text into a new cell. We'll see how REGEXEXTRACT can be used in situations where the amount or placement of the text you want to select within a certain cell is inconsistent, in order to take a portion of text from within one cell and place it within another. We'll also work with CONCATENATE to join the values of cells together.  

Getting Started

  • Go to this linkIf you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson2_Start. Choose Make a Copy. 
  • Scroll through the sheet after you have opened it. This is the mostly the same information used in the previous lesson, but more organized. Between the previous lesson and this one I used the changes we made to clean up the sheet more.
  • I used the conditional formatting we discussed in the last module to pinpoint where duplicates were down column A and across each row so I could get rid of them when applicable. Using conditional formatting to highlight where the cells were that marked the next cell as a writer or a penciler allowed me to separate those names into different columns, the ones you see for Writer1, Writer2 and so on and Penciler1, Penciler2 and so on.  I used the Sort and Filter options to take all the issues that didn't have either a writer or penciler credited and move them into another sheet in the workbook, the one called NeedInfo. I renamed the main sheet IssueInfo to differentiate between the two sheets. 
  • There are also now two additional sheets ExampleForLeft and ExamplesForRegex. Don't worry about these yet, they are just sandboxes for you to be able to practice some of the formulas you'll be learning later in the lesson to see other situations they may be useful for outside of the data for this project. 

Using RIGHT and LEFT to Pull Characters from Start or End of Cell Text

Sometimes because of how you got your data it has the information you need but it isn't formatted 100% correctly. If it's only a few cells, it's not hard to correct it by hand, but if it's whole columns where each cell is clumsily formatted in exactly the same way, you may be able to use some of Sheets' built-in formulas to pull out the information you want. The formula RIGHT will let you pull letters or numbers from the end of the text in a cell (the right-hand side) and LEFT will let you pull letters or numbers from the beginning of text in a cell (the left-hand side). Let's see how that will work on the data on this sheet. 

  • Take a look at Column C - Year1. It has the year that these comics came out in, but it's formatted as Category:2001 because it came from the category landing page for all of the 2001 comics. We can see that and know what it actually means, but if this were a sheet with all the data and we wanted to make a graph that showed changes over time, a graphing program wouldn't look at Category:2001 and recognize that cell is actually talking about the year 2001. 
  • On my full data for this sheet there are lots of different values in that column from Category:2001 all the way to Category:2018 and all the years in between. Simply copying down the last four numbers at the end of the Year1 column cell by cell into a new column would take too long and be too prone to error, but by using RIGHT we'll be grabbing information that we know is correct. 
  • Right-click on the top of Column C and choose Insert 1 Right to create a new column. Name this column Year2
  • In the first cell below the header, start typing =RIGHT( = starts the formula, RIGHT is the name of the formula you want to use, and ( is the start of the parentheses where you'll place the information on what you want the formula to work with, and whatever parameters you have to give it on how to do that.
  • As you start to type that, a little question mark will appear either next to the cell or in the text field showing what you are typing at the top of the page, click on the and it will turn on Formula Help which will give you information on what the formula does, how to format it, what information it needs from you, and an example.
  • In this case the formula wants the string that we want to copy characters from the end of, and the amount of characters that we want it to copy.
  • Click on the cell to the left of the one we are writing the formula in, C1, and the formula automatically fills in that cell number to make the formula  =RIGHT(C1  It also highlights the cell that we have clicked. It even starts to preview what the results will look like by showing you the character the farthest over to the right in a popup window.
  •  But you don't just want one character, you want the 4 it takes to fill out the year. Following the formatting we got from the formula help, let's put in a comma and then the number of characters we want, 4 and close the parenthesis. Now that we've asked it for the last 4 characters, the preview window shows them - 2001. The final formula will be =RIGHT(C2,4)
  • Press Enter and you'll see that the content as it appears in the cell when you are looking on the sheet says 2001, but when you move the cursor onto it, the larger text/formula line at the top has that formula you wrote  =RIGHT(C2,4)
  • Copy the formula by right-clicking on the cell and choosing Copy - Do not just highlight the text in the box up top and copy from there. 
  • Move to the next cell down and paste in the formula. The cell will also display 2001 but when you look at the text box at the top it says the cell contains =RIGHT(C3,4)
    Sheets automatically alters the formula so that it applies to the cell to the immediate left of the cell you put your formula in (C3), rather than the cell you originally wrote the formula to look at (C2). 
    Sheets systematically does this so it's easier to take one formula and apply it down a whole column. It assumes that if you wanted to use the formula to draw from the C column in one row, that that's what you want it to do for the next row down too. Most of the time this is correct and will help you, but in case a formula that you've pasted down a column isn't working the way you expect it to and you're not sure why, take a look at a contents for one of the cells in the formula bar up at the top (this will make the cell(s) that the formula is referencing be highlighted)  and make sure that  the formula referencing the parts of the sheet that you want it to. 
  • Now that you've proved that the formula does what you want, you can copy and paste it down the whole column. Copy cell D2 where you first put in the formula. Highlight everything from D4 down to the end of the sheet, D541 and then paste. The whole column should now be filled with the last 4 characters from the content in column C2001
  • It's important to note that as you can see by clicking on any cell in column D, what the spreadsheet thinks is in there isn't 2001, but the actual formula, so if you want to copy and paste cells from this column, or refer to them elsewhere on the sheet, you'll need to create a plain text version. 
  • Right click on the top of column D and choose Insert 1 right. Copy the contents of column D. Right-click on the new column that you created and choose Paste Special  - Paste  Values Only which you can also access with CTRL - SHIFT - V. 
  • Name the new column Year2Text and as you move the cursor through it, you'll see that what's in the cells is just 2001 and not the formula. 
  • Right-click on Column D and choose Hide Column so you can avoid confusion about which version of the year column you can work with as if it were text. 
  • It's advisable to take the step of making a new column with whatever your formula results are after your data is finalized enough. After all, if I went back into the C column that the formula is relying upon and made a bunch of changes, the results in column D would change but not in column E

At other times, the characters that you want to grab from a cell could be at the beginning of the text in that cell. This is where LEFT comes in. There wasn't a good place with this data to use it, so I created a new sheet within this project for you to practice that formula with. 

  • Click on the tab for the sheet ExampleForLeft at the bottom of the chart. There's a table on this page that has Interviewee at the top of one column and a blank column with the heading of Rank. Let's say this is a scenario where a bunch of service members were interviewed and while their rank was recorded as part of their name, no one originally made a separate column containing their rank. Now later on in the project, we want to have that title recorded as a separate column so we can see what the demographics in our interview subjects according to rank.
  • Since the ranks are all recorded as their 3 letter abbreviations at the start of Column A, we can do this with LEFT since the data is always in the same position within the cells, and requires the same number of characters. 
  • Go to the first cell in the Rank column and start typing =LEFT(
  • Because we turned on Formula Help earlier in this lesson, a pop up will appear explaining what the formula does, and what parameters it needs. It is very similar to RIGHT, it just wants you to provide the string you want it to return characters from...in this case the cell to the left of this one A2, and the number of characters you want which is 3
  • For the first part, click on the cell to the left of it to let it know that's the string you want characters from. You'll see that the preview returns the character P since that's the first character on the left. Next, put in a comma, and type 3 and close the parenthesis. Your formula will look like this: =LEFT(A2,3) and the preview will confirm that you are getting the information you want, which is the first three characters on the left  from the cell you indicated. 
  • Just like with RIGHT, this is a formula that once you have used it once, and verified that it works, you can paste it down the rest of a column. Click to highlight the cell you just put the formula in, copy it, and then highlight cells B3-B8, then paste. The first 3 characters from the indicated cells will appear in the column. Just like with RIGHT, if you move the cursor down through them, you'll see that what it actually contains is the formula.  So if you did need to use these results as text, you'd need to copy them, and paste them elsewhere as values only. 

Using CONCATENATE to Attach Text from One Cell to Text from Another Cell

Maybe instead of data that you want to separate from one column into two, you could have data in two separate columns that you want to make into one column. A first and last name or a street address, city, state and zip code, are some examples that come to mind of pieces of data that might start out being recorded separately, but that you might want to pull together. In this case of this data,  the year and month of publication are being stored in separate columns. That is fine when we look at it, we can mentally put the two together and know what it's saying, but if we wanted to put this data into a graphing program, it wouldn't associate the two pieces of information together unless they were in the same column. Fortunately, we can use CONCATENATE to join them together. 

  • First let's create a column to the right of our Month column (it's column F now on my sheet) by right clicking on it and choosing Insert 1 right.
  • Name the column Month and Year Formula
  • In the first cell of that column below the header, type =CONCATENATE The program will realize what you are doing and give you the option to select that formula from a list. 
  • Click on the ? next to the cell to see information on how to format the formula as we did previously to access Formula Help. 
  • For this one, what the formula wants is for you to put in the string (text or cell containing text) that you want to appear first in this new sequence, and then enter in any other strings or cells in the order you want them to appear in for the final output of the formula.
  • In our case, we want to take the Month in the Month column and the year in the Year2Text column and put them together formatted like December 2001. So we'd start off by clicking on the cell in the Month column in the same row as our formula so it will look like
    =CONCATENATE(F2, 
    A preview will come up showing that it is reading what is in the cell, in this case, December. 
  • Next, add a comma to separate this cell from the next string or cell we want to add. After the comma, click on the second string that we want to have in this new cell which the text in the Year2Text column, E2. Your formula will now look like this =CONCATENATE(F2,E2
  • But uh-oh, when the preview loads you'll see that it ran the two words together. That's because CONCATENATE takes what you say utterly literally as computers love to do. You didn't tell it to put a space in between the contents the two cells, so it didn't put in a space. Fortunately, this formula will accept both cell numbers and strings, you just have to put any other strings you want it to add into your new sequence within quotation marks.
  • Between F2 and E2 in your formula add in " ", separating that new addition and the two cell numbers with a comma. The preview at the top will now change to display a space between the month and the year. Close the parenthesis. Your formula should now be =CONCATENATE(F2," ",E2)
  • After closing the parenthesis and pressing enter, the cell should now read December 2001 Sometimes, Sheets will ask you if you want to autofill the rest of the column. But if it does not, all you need to do is copy the successful formula you made at the top of the columnhighlight all the cells in that column where you want that formula to be(on my sheet it is down to row 541) and paste, and it will paste the formula into each of those columns.  Sheets will automatically adjust the formula so it is referencing the correct cells for that row. 
  • Just like with RIGHT, if you move the cursor down through these cells in column G, you'll see that what the cells actually contain is the formula.  So if you did need to use these results as text (like if you were say going to graph these results or try to paste them in as a table into another sheet), you'd need to copy the column, and paste it elsewhere as values only. 
  • I'll do that now by creating another column to the right of column G, copying column G, then right-clicking on column H and choosing Paste Special and then Paste Values Only. Then I'll change the header for my new column to say Month and Year Text
  • Lastly, I'll right click at the top of column G and choose Hide Column. This way when I want to reference where I have the month and year columns combined, I'm sure that I'm referencing the column that has the actual text in it.  

Using REGEXEXTRACT to Pull Only Certain Data from Another Column

With the previous two modules, we've seen how to get a specific amount of characters from the start or end of text in one column and push it into a new column as well as how to take the contents of two columns and put them together. However, there may be some circumstances where you want to take data from a column and put it into a new one but one or both of these complications means you can't just use RIGHT  or LEFT

  • If it isn't consistent how many characters are in the data that you want to grab from a column. Like, if you want to grab street numbers from a column of street addresses but there might be 1,2,3 or 4 digits involved in that street number. 
  • If the data you want isn't going to be at the beginning or end of a string in a cell and might not even be located in the same place within that string. Like if you have a column of dates that you want to grab the month from, but some are formatted like April 25, 2012  and some are formatted like 22 Dec 2012, you wouldn't be able to use a location-based formula like RIGHT or LEFT to get that month since where the month is in the text of the cell varies. 

This is where Regular Expressions (or regex) can come in.  I get more into Regular Expressions in another tutorial on the DAsH research guide, but basically it's a method supported by many programs and programming languages that allows you to not only search for certain exact key words or phrases but also for certain patterns of characters within a text. So in the case of that month example, if I knew that all the months in the column that I want to be able to pull the month from are the only places in their cell that  use letters instead of numbers, I could use the REGEXEXTRACT to deliver only the group of letters from within that column. 

A good quick reference sheet for Regular Expressions has been created by MIT and it will be the one I use throughout this tutorial. We'll keep it relatively simple here, limiting our exercise to just a few of regex's possibilities, but truly, there are a lot of things you can use Regular Expressions for. 

So on this sheet Column  B -  Volume Information has the information about what volume of a comics title the issue is in. The thing is, it’s got the volume number attached, so a sample cell says Amazing Spider-Man Vol 2, but the only information I care about in that cell is the title Amazing Spider-Man . That Vol 2 will trip me up if I want to just see how many times that title appears across all its volumes or if I want to create a program that searches for that title name on a website to give me more information about it. 

I could use the same Find and Replace feature from the last lesson if the thing I wanted to get rid of was only ever Vol 2, but the problem with that it isn’t just Vol 2 that is at the end of the string, some have Vol 1, some Vol 3. While I guess we could perform Find and Replace for each of those iterations until we’d covered all of them, that’s a lot of extra work, so its fortunate that we can use a regular expression to do some of the heavy lifting for us. 

  • Scroll down through the Volume Information column to see what you can notice about what the characteristics are about the text you want, and what text surrounds the wanted text. Either of those characteristics can be the key to writing a regex that will match with what you want it to grab for you. With a closer look, I can see that there are a series of words, and sometimes numbers (i.e. the very first cell  100 Greatest Marvels of All Time Vol 1) that are at the beginning of the text in the cell, then the word Vol and then a number.
  • This means I'll want to write a regular expression that will match with any number of words and numbers, as long as they occur before Vol and a number. Once there is an expression that matches that pattern, I can use REGEXEXTRACT to extract (in this case, this will basically mean copy) that portion of the text in the cell where my formula is. 
  • Let's take a look at the MIT cheatsheet to see what we need to 
    • (a) symbolize any kind of text
    • (b) symbolize the scope of the section of the text we want to extract
    • (c) symbolize the digit after Vol that we know will be there if it's listing the volume number of that title. We could just write the regular expression to pull only the text before Vol regardless of what's after Vol, but on the off chance there's a comic about a, say a volleyball team, or an adaptation of the movie "Turn Up the Volume" let's be safe and ask it to pull the text before Vol and any number. 
  • From this sheet, we see that
    • . is a symbol matches with any kind of character (letter, number, space, punctuation, etc).
    • We can add * to any character or symbol to say that the regular expression should match with snag anything that is 0 or more repetitions of that character or symbol.
    • That parentheses brackets are the way to specify the scope of what we are looking for as our matched text, in our case, what we want to ask REGEXEXTRACT to actually extract for us, and display in the cell we have the formula in. 
    • \d is the symbol we can use to match with a digit, i.e. 0123456789

Using what we found with the cheatsheet, we can build a regular expression that will look for any kind of text that occurs in a cell before Vol and a digit. 

  • Let's create a column for our new formula to go in. Right-click on Volume Information at the top, and choose Insert 1 right. Name this new Column Title Information Formula
  • In the first cell below the header (C2), start typing =REGEX and you'll see that it gives you a couple of different options:
    • REGEXMATCH which will just output true or false in the cell based on if the regular expression you give it matches the cell you indicate
    • REGEXEXTRACT which is what we’ll be using to copy the information the regular expression is a match for from the indicated cell and paste that information into the cell the formula is in 
    • REGEXREPLACE which a way that you can tell it to find a group of text using regex and replace that text with something else
  • Select REGEXEXTRACT and it will start to populate the cell with the formula. Click on the little question mark at the top and it will display the help information for this formula, along with an example.
  • Basically, what this is telling you is first in this formula, you'd put in the text you'd like to extract something from with a regular expression, and next you'd add in the regular expression you are looking for a match for. The latter is supposed to be within quotation marks. 
  • Click on the cell to the left of where you have the formula, the first cell in the Volume Information column since that's the input text we want to search through with the regular expression.  Add a comma to move us on to the next section
    =REGEXEXTRACT(B2,
  • Start with a set of quotation marks, and then in between them start a parenthesis. Within the parenthesis is where we will put in the regular expressions stand-in for the portion of the text we want it to extract
    =REGEXEXTRACT(B2,"()"
  • In between the parenthesis put in first a so that it will match with any one character. The preview will now display a 1 since that's the first character of the text in the cell we've told it to look in.
  • Add a * next to the and you'll see that the preview now shows the contents of the entire cell, because by adding * as a quantifier to your . symbol which stands in for any character, the regex is grabbing all the characters because it's matching any character 0 or more times. 
    =REGEXEXTRACT(B2,"(.*)"
  • But if we just wanted to match the entirety of the cell, we could just copy and paste the quantity of the cell. Next go outside of the scope of the parenthesis to specify where you want the regular expression to stop matching. Still within the quotation marks add in Vol , a space and \d. The \d  is a stand-in for the number that comes after Vol. You'll see that the preview now just shows the portion of the cell text in B2 that comes before Vol 1
    =REGEXEXTRACT(B2,"(.*) Vol \d"
  • The preview now shows that it's delivering what we want, so let's close the parenthesis for the formula and press enter to complete it. Now the Title Information Formula column in the first cell shows the same information as Volume Information, just minus the Vol 1 at the end. 
  • It may give you an option to just auto-fill the formula down every cell in the column. If so, go with it. If not, all you need to do is copy the formula that you put in C2 by copying the cell. Then highlight every cell in the column that you want the formula's results to appear in and paste. 
  • Scroll through once it's done to make sure it always gave you the portion of the cells that you wanted it to. In this case, that's a yes. In other cases, you might either take a closer look at the formula to figure out how to tweak the regular expression to work on whatever exceptions exist in the cell it didn't work for, or if the cells it fails for are few enough, you might just correct the ones that are wrong by hand. 
  • Just like with RIGHT, if you move the cursor down through these cells in Column C, you'll see that what it actually contains is the formula.  So if you did need to use these results as text, like if you wanted to copy them into another spreadsheet, you'd need to copy them, and paste them elsewhere as values only. 
  • I'll do that by creating another column to the right of column C, copying column C, then right-clicking on column D and choosing Paste Special and then Paste Values Only. Then I'll change the header to say Title Information Text
  • Lastly, I'll right click at the top of column C and choose Hide Column. This way when I want to reference where I have the title information, I'm sure that I'm referencing the column that has the actual text in it.  

To get a little more practice with regular expressions, let's click over to the sheet ExamplesForRegex. I've put in a couple of different tables that we can use to practice a few different scenarios for these. 

In the first table,  there are bunch of different addresses under where it says Street Address and a blank column next to it marked State Abbreviation. Let's say we wanted to quantify how many of these addresses were in each state, but when we collected the addresses, we didn't make a separate column for the abbreviation for the state. We can't use RIGHT or LEFT to get to the state since they're within the middle of the text, and they aren't consistently the same length from either end. 

  • Upon closer inspection, we'll see that state abbreviations are always 2 capital letters next to each other and it looks like that's the only place in each street address where that pattern of letters occurs, so that means if we write a regular expression that matches with two capital letters next to each other, it will match only with the state abbreviations. 
  • Let's go back to the MIT cheatsheet and see how we can symbolize two capital letters together. In the section on character classes we see that if we put characters within brackets, we are telling it to look for those specific characters or ranges. So if I want to have it just look for capital letters I can use the character class [A-Z].
  • To look for a capital letter more than once in a sequence, I can use the { } mentioned in the Quantifiers, to specify that I want exactly 2 repetitions of that range, so [A-Z]{2}
  •  So, let's put this all together in column of the table in ExamplesForRegex 
  • We'll start out with =REGEXEXTRACT( then click into the column to the left since that's the text we want it to be searching for our regular expression, and then add a comma. It'll now read =REGEXEXTRACT(A2,
  • Let's start off with the double quotation marks, since our regular expression will need to go in quotations, and then the parentheses where we'll be placing the regular expression we'll be using to search for a match 
    =REGEXEXTRACT(A2,"()"
  • Within those parentheses, let's first add the character class for the range that we want it to look for, capital letters. Character classes go in brackets, so it will look like this =REGEXEXTRACT(A2,"([A-Z])" In the preview, it will show that right now, it's displaying the first capital letter in A2 which is an F
  • But what we want is to display where there are 2 capital letters next to each other. To request that it look for 2 characters in the range we set of [A-Z] next to each other, we need to add the quantifier {2} to that range. Keep it all within the parentheses and quotation marks. The formula will now read - =REGEXEXTRACT(A2,"([A-Z]{2})"
    Now we'll see that the preview window shows the result that we want which is NY, the state abbreviation that consists of two capital letters. 
  • Close the parenthesis for the whole formula and press Enter. NY will now be the visible text for B2
  • It may give you an option to just auto-fill  the formula down every cell in the column. If so, go with it. If not, all you need to do is copy the formula that you put in B2 by highlighting the cell and copying it. Then highlight every cell in the column that you want it to appear in and paste. 
  • Scroll through once it's done to make sure it always gave you the portion of the cells that you wanted it to. In this case, that's a yes. In other cases, you might either take a closer look at the formula to figure out how to tweak the regular expression to work on the cells where it does not work, or if the cells it fails for are few enough, you might just correct the ones that are wrong by hand. So if within a street name or city name there are other occasions that two capital letters are next to each other, you might have to rewrite the regular expression to make sure it was only getting two capital letters by themselves rather than within a word.
  • Just like with our other REGEXEXTRACT example, if you move the cursor down through these cells in Column B, you'll see that what they actually contain is the formula.  So if you did need to use these results as text , you'd need to copy them, and paste them elsewhere as values only. 

For the second table, let's say that we wanted to extract the year from the date that's in column D. If the people who'd recorded the date were consistent, this would be easy, however some dates have AD, BC or BCE at the end of them which means we can't use RIGHT to pull just the last 4 characters from the text in the column. Additionally we can't use LEFT to get to it from the other end because the months and days as they are written out here all have a different number of letters in them. So this is another occasion where REGEXEXTRACT can be very helpful. 

  • First, let's look at the data in column D and see how we could sum up what a year would look like as a regular expression. It is a group of digits, but we can't just ask it to look for a digit, since the day is also expressed as a digit. We see that a year is 4 digits in a row...although in this case we see that at least one year is 3 digits (D5).
  • Not to worry, just like how we were able to express that we wanted a specific number of symbols for our character class before, we can also express that we need a specific range of quantities of digits, by putting a range within those curly brackets. In the Quantifiers section we'll see that if we use {3,4} it will look for between 3 and 4 (inclusive) instances of whatever character we are asking it to look for. In our earlier section about REGEXEXTRACT we already learned the symbol for a digit -  \d
  • So, let's put this all together in column E of the table in ExamplesForRegex
  • We'll start out with =REGEXEXTRACT( then click into the column to the left since that's the text we want it to be searching for a match, and then add a comma. It'll now read =REGEXEXTRACT(D2,
  • Let's start off with the double quotation marks, since our regular expression will need to go in quotations, and then the parentheses where we'll be placing the regular expression we'll be using to search for a match 
    =REGEXEXTRACT(D2,"()"
  • Within those parentheses, let's first add the \d special character that will search for a digit, this will look like  
    =REGEXEXTRACT(D2,"(\d)" 
    In the preview, it will show that right now, it's displaying the first number in D2 which is a 7
  • However, we want it to only return digits that appear in a group of either 3 or 4 digits in a row, so we'll add a quantifier to the \d special character using the curly brackets. This will look like 
    =REGEXEXTRACT(D2,"(\d{3,4})"
    And in the preview, we'll see that it's now matching to what we want it to which is the year 1780 in column D2
  • Close the parenthesis for the whole formula and press Enter. 1780 will now be the visible text for E2
  • It may give you an option to just auto-fill  the formula down every cell in the column. If so, go with it. If not, all you need to do is copy the formula that you put in E2 by moving the cursor onto the cell and copying it. Then highlight every cell in the column that you want the result of the formula to appear in and paste. 
  • Scroll through once it's done to make sure it always gave you the portion of the cells that you wanted it to. In this case, that's a yes, it even found the year that had only 3 digits because of how you worded your formula. In other cases, you might either take a closer look at the formula to figure out how to tweak the regular expression to work on the exceptions too, or if the cells it fails for are few enough, you might just correct the ones that are wrong by hand. So if it turns out that there are occasions where there are double-digit years, you'd have to either correct those one by one or try to rewrite the regular expression to find digits that occurred later in the text.
  • Just like with our other REGEXEXTRACT example, if you move the cursor down through these cells in Column E, you'll see that what they actually contain is the formula.  So if you did need to use these results as text (like if you were say going to graph these results), you'd need to copy them, and paste them elsewhere as values only. 

With this module, we've gone over different formulas to be used to extract specific text from text used in a cell using  RIGHT or LEFT if the portion required is in a specific location and consists of a uniform number of characters or REGEXEXTRACT if more flexibility on location or character number is needed. We've also seen how CONCATENATE can allow us to combine the text content of multiple cells. If you've had any issues following along and want to see a copy of my finished work for this lesson, go to this link to make a copy of the spreadsheet and see all of the formulas and the finished product.

Learning Goals

Once the data within a spreadsheet is cleaned, it is easier to take stock of what you have in it. With this lesson, we'll use some of Sheet's functions to begin to get summaries of the data we have in the sheet. We'll see how this can be used to apply to text values and see many how many unique values are in a column using UNIQUE.  We'll also use COUNTA and COUNTIF to see how many times different terms are used within a column or row. 

Getting Started

  • Go to this linkIf you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson3_Start Choose Make a Copy. 
  • Scroll through the sheet after you have opened it.  Between the previous lesson and this one, I deleted the extra sheets that we used for exercises and deleted the extra columns that contained the formulas for Year2Text, Month and Year Text and Title Information Text. Having a lot of extra formulas on your sheet can drag down how well you can move through it and its load time in general.  I didn't think twice about deleting it, however, because I was still keeping the original copy that had the formulas in case I had any question about why a cell in these text columns was the way it was. I was also able to cut the formulas because the columns that the formulas were referencing weren't going to have any more changes made to them. If you are in a situation where the column you are referencing in a formula is going to keep changing (like if you're waiting to find out if the data in it is final or not) you wouldn't want to delete the column with the formula. 
  • I've added a few new columns for Total Writers on Issue  and Total Pencilers On Issue. They are blank and we'll be filling them in later in this module using formulas to total that up for the rows. 
  • There are also an additional sheets called AllTitles and AllMonths. They are blank for now, and you won't need to worry about them yet. 

Using UNIQUE to Find Out all the Values in a Column

Within column C - Title Information Text there is a row for every issue of a title that came out in 2001. For my project, I had to have a list of all the different titles published so that I could look the titles on that list up and get extra information on whether the superhero in it was a man or a woman. If I just wanted to get a list of every title included in this column, I couldn't just copy the whole column because there are a lot of duplicates. The UNIQUE function will just look at an entire column or row of data and just list all the unique values that are in it. So if I use that function on this column, no matter how many times "Deadpool" appears as the entry in column C it will only appear on the unique value list once. Awesomely, this even works if the column or row isn't sorted in alphabetical order. 

  • Go to the AllTitles section sheet in the workbook. In the top column, write =UNIQUE( and it will start to suggest that formula. You can click on the suggestion, and it will give a further help on how to set up the formula. Sometimes you'll need to select the little ? next to the formula up at the top to do this same thing.
  • What the formula requires in the parenthesis is the range of cells that we want it to make a list of unique values from. The second two values the formula wants parameters for are: whether the data is being filtered by columns (TRUE) or rows (FALSE) and whether to only return entries that don't have any duplicates.
    Both of those last two parameters are going to default to FALSE if we don't tell it which we want. This is good in our case because we are only looking at one column, so we want it to be checking if each row is unique (so by_column should be FALSE) and we want the second one to also be set to FALSE as we know nearly every one of these items is going to be listed more than once within the column because they're the titles of series.  
  • So let's give this formula the range that we want it to use. First, click anywhere on the sheet, and you'll see that it starts to populate the formula with that cell number, and that cell will appear highlighted.
  • Click into the IssueInfo  sheet, using the tab at the bottom of the workbook and highlight all of column C with the exception of the header. You'll see that the whole column gets highlighted, and the formula up top gets shifted so that it it includes the sheet number before it lists the cell range you want it to apply to. The location where the formula is stays in cell up at the top - AllTitles!A1 This is what will happen when you want a formula in one sheet to reference another sheet within your Google Sheets workbook, it will include the name of that sheet within the formula.
  • For the other parameters, the default values will work for us, so close the parenthesis.
    This will look like - =UNIQUE(IssueInfo!C2:C541)
  • Press Enter and it will execute the formula and take you back to the AllTitles sheet. All the unique values in the Title Information Text column will appear in a list below the cell that contains the formula, in the order that they appeared in the original column.
  • But the list that goes down that column is only there because of the UNIQUE formula, so if you want to do anything with these names, like alphabetize them or edit them, you'll need to highlight and copy and then paste them into a new columns as values. You'll also be unable to sort them, they can only appear in the order that they appeared in the sheet. 
  • A  way around this is to change the UNIQUE formula so that they are sorted. Go up to the cell that contains the formula, and after the =, add  SORT( . Then at the very end of the UNIQUE formula, add another parenthesis to close the parenthesis from the SORT formula that you've added so it encompasses the results of the UNIQUE formula.
    It will look like this: =SORT(UNIQUE(IssueInfo!C2:C541)) Press enter
  • Now your list is sorted alphabetically. How you have the formula set up now, it is dynamic, and so if you make any changes to any of the values in the range you listed, it will change how they are listed on this sheet. 
  • Let's do the same thing for the Month column. Go to the AllMonths sheet and in the first cell, type =UNIQUE(
  • Click anywhere within the sheet to trigger it populating in the cell range you select, then click into the IssueInfo sheet in the workbook and highlight the entire column of column F - Month, with the exception of the header row, so F2:F541
    The formula will look like  =UNIQUE(IssueInfo!F2:F541)
  • Close the parenthesis and press enter and it will take you over to the AllMonths sheet where all of the unique values in the Month column are now listed.

This might be a thing you want to do if you want to get a list of all the different values you’ll have to feed into an API or search for in some other way. Rather than copying the list, and then using Remove Duplicates, then Sort to get the list alphabetical, this formula accomplishes the same thing, just in one step.

Boiling down all the unique values in a column can also be a way that you might wind up finding problems with your data. For example, if you have days of the week that people wrote into a spreadsheet by hand, you might discover that some wrote out the day in full, some wrote it abbreviated, and some misspelled it. If it only happened a few times, you might miss it until it came time to graph your data and then you'd have to go back and correct it all one by one. Even within this example, on the AllMonths sheet, we can see July appears two times on the list, one with a space in front, one without. That's something we'd look into correcting, but we'd want to figure out which was the right one first.

Creating this kind of summary of what values exist your data can also be a good first step to coming up with a summary for how many times those values occur within your data, so lets see how many times each of the months appear as a publication date on our sheet. . 

Using COUNTIF to count the totals for each value

Now that we've created a sheet that contains all the publication months listed for issues that came out in 2001 we can see how many issues came out each month using COUNTIF, which is a formula you can use when you want Sheets to tell you how many cells fit a certain criterion within a certain range. 

This criterion can be numeric if the range you want to search is numbers, so, say if you had census data, you could use this function to count how many cells in a column for Age listed a number below 10 . If a column has alphabetical values, like in our example sheet, you can have it count how many cells in a row or column match a certain text either in its entirety, or if it begins or ends with a certain group of letters. 

If you have a list of all the values in a column (like we do because we just used UNIQUE to get one) getting a count of how many time each of them occurred is pretty easy with COUNTIF

  • Go to the AllMonths sheet and click into the cell next to the first value in your months list, B1. Start typing =COUNTIF and a popup will appear giving you the option to click on an arrow and find out more about this formula. Click on it. 
  • The help popup lets us know that the two items needed to execute the formula is first, the range, or what cells we want it to look through to count up how many times our search term or numeric condition is met, and second the criterion, the search term or numeric condition we want to be counted. 
  • In our case, the range we are going to search is the column of F, the same one that we searched to generate the list of unique values in the cell to the left of this one, so first, click anywhere on the sheet to let it know that we are going to select something within the sheet. Then click into the IssueInfo sheet of the workbook and highlight the column. Add a comma. The formula will now read
    =COUNTIF(IssueInfo!F:F,
  • We've told it the range that we want it to look through when it is counting, next we need to tell it what to count, the criterion. In this case, we want it to count how often each month appears.
  • Go back to the AllMonths tab. Click to the cell to the left, A1 where the word November is. You'll see the preview window pops up with the number 49. That means that November is the publication month listed for 49 of these rows. 

     
  • Close the parenthesis and press enter. Now you can copy this formula and paste it alongside the rest of the months. It will automatically update the criterion portion of the formula to use A2 when you put the formula in B2 and so on. (Note: if we hadn't highlighted the whole column, that is a thing we'd have to worry about for the rows in the first part of the formula. The range, if there were rows listed in what to search would switch to counting one row further down just like the criterion portion of the formula  - Keep this in mind when using a formula down a whole row or column.)

     
  • From this list we can see a couple of things. For one,  for those two different entries for July, the one without a space at the front only occurs once. That tells us that one is in error, so we'd want to make that and the other July match on the regular sheet where the information came from. We can also see that the number of issues per month falls during the summer, so maybe series tend to start at the last few months of the year and end at the beginning of the first few months of the year? That's a premise we could test if we had data for more than one year to see if it was consistent. So making these kinds of summaries can help you see what sort of trends your data shows.
  • It should be noted that for both columns of this AllMonths sheet, looking at it as a whole, it appears to contain the actual numbers, but in reality, the cells only have the formula so if you wanted to work with this information like it was the actual values, or paste it as a table somewhere else, you'd have to do the same copy and paste as values trick from the previous modules.  

Using COUNTA to count the number of cells with values in a row

If instead of wanting to count when a certain condition is met, we simply want to count how many cells in a row or column have values in them, whatever those values are, we can use COUNTA

In columns J through N, the writers credited for an issue are listed, one name per cell and in columns O through V, the names of the pencilers credited for an issue are recorded the same way. If we want to find out the total number of writers or pencilers credited for each issue, we can use COUNTA to get that total for each row across each of those columns.

  • Go to the column where the totals for each issue's writers will go, column H and start typing the formula =COUNTA( 
    The formula will pop up as one of the options for you to select. Click on it, and Formula Help will give you more information about the formula. 

  • It lets you know that you just need to give the formula a range (or ranges) to look through, and it will count up the number of cells with any text or numbers in them that are in that range. 

  • So since we want the total for the number of writers on this issue, and we know that data is stored in columns, J-N we'll use that as the range for the counting to occur in. Click on J2 and then stretch the box all the way over to N2. The preview will let you know that the result of the formula is 1 which will match with what you see by looking at what the contents of those columns in this row are. Close out the formula by using an end parenthesis and hit enter. The final formula will look like:
    =COUNTA(J2:N2)

  • Let's do the same thing but for the pencilers. The range that contains the names of the pencilers is columns O-V. So click in cell I2 and start another COUNTA formula. This time the range you will click and drag across will be O2:V2.The preview will tell you that the result is going to be 1 which matches with what you see when you look at this row. The final formula will be:
    =COUNTA(O2:V2)

  • Highlight and copy H2 and I2. Then highlight the rest of the cells in those columns all the way down to where the data ends at row 541 and then paste. It will automatically amend the rows in the range so that the formulas are looking to the columns J-N for the writer total column and O-V for the penciler column I, but with row number changed to refer to that particular row.

  • Once you've broadly applied it, scroll through and spot check some lines here and there to make sure that the results that the formula gave makes sense with the amount of writers or pencilers that you can count in that row.

These are just some of the formulas you can use to create numerical summaries of your data. COUNTUNIQUE will count the number of unique values in a range. SUM will total up a range of numbersSUMIF works similarly to COUNTIF, and will give you the sum of whatever cells within a range match your chosen criteria. 

If you had any issues with following along with the formulas we've explored here, check out this finished form of the sheet I made by following these steps. If you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson3_End Choose Make a Copy. You can look through this copy of the sheet to see what my version of this with all the formulas looks like. 

Learning Goals

Once the data has been cleaned and organized, next steps can be taken. One step we can take is to augment the data in this table with data from another source. Using VLOOKUP, we'll be setting up a relationship between our main sheet and a sheet with additional title data in order to get information from that supplementary sheet onto the main sheet. With this accomplished we'll use the Explore and Column Stats function in order to get summaries of our data. We'll also see how to create a Pivot Table to put together our summarized data and see how to visualize that data in a Chart. These are key last steps to converting this collection of data into something understandable to an audience who just wants a summary of what you've found rather than all the raw and unprocessed numbers.

Getting Started

  • Go to this linkIf you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson4_Start Choose Make a Copy. 
  • You'll see that it contains most of the same information as the one we had at the end of the last lesson, with a few exceptions. For the columns where we had our totals for pencilers and writers, I've converted them from being formulas to just being the values we found with those formulas so the file doesn't lag too much. The same thing is true for the listing of issues produced per month in the AllMonths tab (where I also cleaned up the fact that there were two entries for July, one with an extra space at the beginning, one without)
  • I've also added a new column in the main sheet IssueInfo just to the right of our Title Information Text column, called Gender of Title Protagonist(s). It's blank for now, but we'll be feeding information into it from the sheet AllTitleHeroGender that's also been added to this workbook from the one we made in the previous lesson. That contains all the different titles on this sheet in one column and what gender their title protagonist(s) are or the gender makeup of the title team in another. 

Using VLOOKUP to Connect Data From Different Tables

Some of you may be familiar with the concept of joining a map layer or table to another with ArcGIS, some may have worked with a database program like Microsoft Access in order to connect one sheet to another but Google Sheets has a function called VLOOKUP that lets you create something like a relational database to link data from different sheets on the same entity together. 

Basically what a relational database lets you do is, if there is a key value that exists on two different sheets, like say, the name of the high school, you can take a value associated with that high school on one sheet, and also associate it with that name on the other sheet. So, let's say we had a table that had graduation rates by year for a bunch of different high schools that we got from the Board of Education, but we wanted to see what this information looked like on a map. The Board of Education graduation rate data didn't have coordinates for any of the schools recorded, so we couldn't do that with the information on the existing sheet. Next,  we found a listing of coordinates for each of these high schools. But, not surprisingly, the table with the geographic information, didn't have graduation rates. With a relational database, the two tables can linked to each other with the name of the high school serving as a key, and the geographic information for say, Midtown High School can be brought into the table about its graduation rates each year.

This can be a very handy thing to know how to do when you are trying to unite data about the same person, place or other entity that you got from different sources. 

In this instance, I have one sheet where I have the information about when each issue of Marvel Comics was published in 2001 and who worked on it: IssueInfo and another sheet where I have information on each title about what gender the superheroes in that title are: AllTitleHeroGender. Now if I didn't know about relational databases I could just go back and forth between one sheet and the other and cut and paste in the information, but it would take forever and I would inevitably make some mistakes.  

Or we can use VLOOKUP (Vertical Lookup) and have a formula do the same thing at a much larger scale: look up the value we want to find on the AllTitleHeroGender sheet, and when it finds the row that value is in in the _Title Column, move over to the next column we want it to and return the value that is in the same row as the title we were looking for but in the _Title Protagonist(s) Gender column.

  • Go into the AllTitleHeroGender sheet and take a look at it. You'll see that in column there is a the list of unique titles that we made last time around and in column B there is a designation of who the hero or heroes of the book are. Whether it's about a single character, a duo or a group and what the gender of the hero or heroes are, or what the gender makeup of the group is. You might notice that the names in the header row have an underscore (_) at the front of them. This is because the column of values to be searched through has to be in alphabetical order for VLOOKUP to work correctly. If you have a table that already comes with headers, just put a _ at the front of the version that you've pasted into Google Sheets and as long as the rest of your data doesn't have an underscore at the front of any of the entries, the formula will read it like it's already in alphabetical order. 
  • Go to the Gender of Title Protagonist(s) column in your IssueInfo sheet and start to type =VLOOKUP( and click on the popup when it suggests VLOOKUP. Take a look at the parameters it wants you to give it. 
  • The items that the formula wants, in order, are: 
    • search_key - the thing you want it to search for  
    • range - the range of cells you want it to look through, with the first column in that range being the one it will look for the search key in, so if the range of cells you want the formula look through doesn't have the column you want it to search through at the far left of the range, you'll have to rearrange the table to make it that way.
    • index - the column in relation to your start column that you want returned when the value is found. 1 will correspond with the same column that it searched through, 2 will be the column to the immediate right of it, 3 will be the column over from that, and so on
    • is_sorted which will have a TRUE or FALSE response, TRUE meaning the column at the start of the range is in alphabetical order, FALSE meaning it is not. Even though I said above that the range column had to be in alphabetical order for the formula to work right, writing FALSE for the last parameter here is something you can try if for some reason you can't alter your table to put the first column into alphabetical order. However, I've noticed that this is buggy and will sometimes return an error as a result if you try to put FALSE for that last statement so avoid a situation where you need to do that if you can. 
  • Let's start filling the formula out, with the scenario for this sheet. For the first part, search_key, what you want to look for is the contents of the cell directly to the left of this column, C2 so click on that cell. Add a comma and it will prompt you to next select the range
    The formula now looks like - =VLOOKUP(C2,
  • Click into the AllTitleHeroGender sheet and highlight columns A and B. You are making  the column that you want it to search through and telling it that the range of columns you'll want it to look for information in or return information from are just going to be A and B. If you are dealing with a sheet with a whole lot more than two columns, just make sure that the range that you select includes the column you want it to search for information in as the first column in the range, and whatever column you want it to return information from as one of the columns included in the range. Add a comma after you've done this.
    The formula now looks like:  =VLOOKUP(C2,AllTitleHeroGender!A:B,
  • Next you want to give it the index or what column in relation to the starting column you want it to give you the value from. In this case you want it from column B which is the next column over from A, so you'll choose 21 is what you would use if you wanted it to just return the same value for some reason. The preview window now shows that it is returning Group-Mixed which since the first cell you are experimenting on this with is 100 Greatest Marvels of All Time and that preview result does actually correspond with what is in column B next to that title it looks like the formula is working.
    The formula now looks like - =VLOOKUP(C2,AllTitleHeroGender!A:B,2
  • Lastly, let's specify whether or not the first column is sorted alphabetically. Write TRUE and close the parenthesis, and hit enter. The final formula will look like this:
    =VLOOKUP(C2,AllTitleHeroGender!A:B,2,TRUE)
  • Sheets will take you back to the IssueInfo sheet where the formula is located and show you the result which is Group - Mixed
  • Copy the cell that the formula is in, and highlight the first 60 or so group of cells in column D and paste. Scroll through the results and compare them to the information in the AllTitleHeroGender sheet. You'll see that the results we got using VLOOKUP to search for the values in the Title Information Text column match with what the results are in column B for that title on the AllTitleHeroGender
  • Now that you've double-checked if your formula is working, apply the results to the whole column by copying that cell, highlighting the rest of the column and pasting the formula for it. Scroll through and spot check to make sure that it's still delivering the results you expect. 
  • Just as in previous times that we've used formulas what's actually in the cell is the formula not the actual text, so if you need to do anything else with this column, you'll have to copy and paste it as values only. 

Finding Summaries of Your Raw Data with Column Stats and Explore

Whether you are trying to find out exactly what your data says for purposes of seeing what it says for your hypothesis or finding out what other questions it may answer, or whether you are at the stage where all of those questions are pretty answered and you just want to be able to have the right numbers to assign to your findings or tables to add to your paper or presentation, there are some quick ways you can make organized summaries in Sheets.

Column Stats will let you just create a quick chart or table giving you the information about what is in a column. Explore will let you create that kind of quick table, plus let you ask other questions and combine the information in multiple columns. 

  • Highlight the column you just created, column D. Right-click anywhere on it and from the list of options choose Column Stats. A pop-up window will appear on the right side of the sheet and display a chart letting you know the amount that each of the values in the column appeared. If you scroll down the same summary will appear as a table listing the most and least frequent values for that column. At the bottom, there's also statistics about how many cells appeared in the analysis. 
  • This can be a step you can take if you have an overwhelming amount of data on your sheet and just want to get a sense of which items occur most or least frequently in a given column.
  • You can ask questions that combine the values in multiple columns or perform other mathematic functions using Explore which is the little green button at the bottom right. Click on it, and a pop-up will appear with several different options.
    • At the top is the range that it is being considered for this Explore pop-up which you can alter if you only want to look at a sub-section of the data. 
    • Below that is Answers which is a function to let you just type in a plain-language question and it will answer it for you based on your data. It's frankly, pretty awesome and intuitive to use, which more on how to use it in a second, but it's important to keep in mind that it's not psychic and you need to ask it questions that use the language of the column names or text in your data. 
    • After that is a Pivot Table that Explore has auto-generated for you of a summary it guessed you might find interesting based on your data. You can click More to see a selection of other possibilities. We'll get into how to make a pivot table from scratch later in this module
    • Under Analysis there are some graphs that it has auto-generated for you, but we'll get into how to make these from scratch in a few. 
  • For Answers let's try one of the suggested topics. For me, one of them was Most frequent Writer1, and I'll click on it. The Explore window shifts to show me the answer to my question, which is Brian Michael Bendis with a count of 36. If I click on Show Formula it will show me the formula it used to find that answer so I can modify it if I want to use it somewhere else on the sheet for a similar question.
  • You can type in your own questions you want to have answered and it will suggest some options of what it thinks you mean but remember, you have to use the same language as the column names and cell values in the sheet. If I want to find out what percentage of issues have a superheroine and I start typing in How many issues with lady heroes I'll get a "Sorry, couldn't understand your question". This is because the designation used for a comic book with a woman lead that appears in the sheet is actually Female Protagonist and the column that information is stored in is called Gender of Title Protagonist(s) and so the words issues, lady and heroes don't appear anywhere in the columns that can be used to answer my question.
  • If instead I follow the prompts of the suggested questions it provides and write what percentage of Column D is Female Protagonist it will answer my question and it turns out the share of Column D that lists issues about a female protagonist is 6.48%
  • You can use Explore to ask questions that combines information from multiple columns. To find out how many writers were credited on an issue each month I'd write Sum (because I wanted the total) of Total Writers on Issue (since that's the column I wanted totaled) by Month (since that's the column I wanted it broken out by.
  • However as you may find by playing with this Answers function, it can sometimes be difficult to tell it in plain language what you want while still making it a request it can understand.  You might get the same table created more easily by just making a Pivot Table

Making a Pivot Table with Your Data

Pivot Table is a function that a lot of spreadsheet programs have that lets you create a table to summarize the data you've accumulated into a bunch of smaller categories. You can use it to create sums or averages. It's a way to reorganize your data to make it clearer or to cross reference the contents of a few different columns.

  • Go to Data and choose Pivot Table. It will ask you if you wanted to add the table to a new sheet, or add it to one of our existing sheets. Choose New sheet since we don't need it cluttering up this one, and click on Create
  • It will take you to a new tab that's basically a blank sheet, and will let you populate the chart how you see fit. To do this, think of what you want to see. Do you want to see what the breakout is for the Gender of the Title Protagonist is by Month? 
  • Click on Month and drag it into Rows, and then click on Gender of Title Protagonist and drag it underneath Month in the Rows. You'll see that the setup of the sheet is there, but there are no numbers. That's because you haven't told it what Values you want 
  • Click on Add next to Values and choose Gender of Title Protagonist. By default it will choose CountA and the count of how many issues for each title protagonist gender were released for each month will appear on the pivot table.
  • Let's say you wanted to see how many writers and pencilers there were per month. Drag Month into Rows, and Total Writers on Issue and Total Pencilers on Issue into Values and choose that you want to display them as sums and as columns
  • If you only wanted to display a subset of the data, like if I say just wanted to see how many writers and pencilers are working on titles with male protagonist, we can filter it. 
  • Click on Add next to Filters, and from the list of what you want it to select by, choose Gender of Title Protagonist(s).  It will give you a list of all the options within that column with every one with a check mark next to it. Un-check every one except for Male Protagonist and see how the totals all change.  
  • If you want to put any of these pivot tables in a presentation or paper, you can just screen-shot it, or go to File -> Download and choose that you want to download the current page as a PDF. 

Making a Chart with Your Data

Maybe you want to have a more flashy way to display your data than a table. If so, Google Sheets has a built in Charts function that can help you out. Just so you know, there are limits to how much customization you can do with these Charts, so if you're looking to do more, something like Tableau might be more your speed. 

  • Go back to the main IssueInfo sheet of the workbook, and then choose Insert from the menu up top and then choose Chart
  • It will generate a pie chart that by default chooses the Volume Information column as the information it displays. Let's say we want to see what the breakout is on the Gender of Title Protagonist(s) column instead. Go to where it says Data Range  and click on the little box next to where it says B1:B541 is the current data range.
  • In the Select a data range pop up that appears, delete the existing information and then click on the D column. The box will change to display D:D. Choose OK
  • You'll see that the chart now changes to show the pie chart for the information in that column. Click on any element in the chart to see what you can configure about that element. For instance if you click on the title, it'll take you to the Chart & axis titles section of the Customize section of the Chart Editor and you can change the title to Gender Breakout of Title Protagonists
  • If we want to change the color of a slice of the pie, we can do that by double clicking on the slice, letting it take us to the Pie slice section of the Customize menu and then selecting a new color for that slice from the dropdown
    Customize menu. The pie slice for Group - Mixed is now purple
  • Explore the other options on the Chart Editor window. Anything under Setup will change what range of information the chart covers and what kind of chart it is. Underneath Customize you can change colors, titles, fonts and other aspects of its appearance.
  • The three dots at the top corner of your chart provide you with other options like to Download it as an image file,  Copy it, give it Alt Text or the one I'll choose right now to get it out of the area of my Sheet where all the data is, Move to own Sheet
  • So once you've cleaned and organized your data and you want to sum some aspect of it visually, adding a Chart allows you to do that.

If you had any issues with following along with the formulas we've explored here or with creating the pivot tables or the chart, check out this finished form of the sheet I made by following these steps. If you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson4_End Choose Make a Copy. You can look through this copy of the sheet to see what my version of this with all the formulas and the pivot table and chart looks like. 

That's the end of this tutorial about how to use Google Sheets to clean, organize, summarize and eventually visualize your data. If there's anything that isn't covered that you want to know if you can do to your data in Sheets, please feel free to contact your instructor and me, your DAsH librarian. The contact information for me is on the side of the page.