If you are working on a DAsH project that has involved you grabbing more data than you know what to do with, Google Sheets has a lot of tips and tricks that can help you make sense of that data. You have access to Google Sheets by virtue of your Google Drive at the college so it is free, but it should be noted that the information in it does live in Google's server, so if you have confidential information that you want to keep only on your computer, this isn't the way to go. Additionally, while in their help section Google says say they support up to 5 million cells in Sheets, in reality, if you will be using things like formulas and conditional formatting (as we will be doing below) it will get slow at a lot lower of a cell count than that. However, I have chosen this program to use to talk about helpful formulas and formatting tricks since it's free and relatively easy, and a lot of the formulas being used have some equivalent in Excel or OpenOffice.
Sometimes the data that you will want to use with your project will be messy. In the case of the project that I will be using as an example for these tutorials, the data is messy because it was web-scraped from a site where the information I wanted wasn't set up in a nice neat table. In other cases, you'll be trying to put together data that came from different sources, data that may contain duplicates or data that was collected without concern for consistency in what labels were used for different events or classifications.
In order to take that raw data and turn it into something that can be graphed or analyzed in other ways, your data has to be organized. Figures in a given column all have to mean the same thing. Choices have to be made about which rows are duplicates and should be deleted from the sheet. Extra formatting characters that just confuse things need to be gotten rid of before they confuse your analysis programs or your readers.
In this lesson, we'll use Search and Replace to get rid of extra characters cluttering up our data and Conditional Formatting to make patterns in our data visible with colors. We'll also talk about how Sort and Filter differ, and how each can be used in different ways to help you see what you have in your sheet.
You may be familiar with using Search to look for keywords in a document, but Sheets (along with most other editing softwares) will let you not only look for a word or character, but will let you replace it with something else. Throughout this sheet, there are lots of formatting peculiarities due to the information originally being web addresses that are consistent both in how they appear and that they don't serve any useful function, so they are ideal to take out with a Find and Replace.
Now that we've taken out parts of the text in this sheet that we do not need, there are other parts of it that we want to call attention to. We can use conditional formatting to do so. This is a feature on a Sheets as well as other spreadsheet programs that will format cells that have certain kind of content in them differently from others that don't have duplicates. We'll use it in this instance to make cells that contain duplicates of other cells in their column in one case (and row in another) a yellow color so we can decide what to do with the duplicates. We'll also use conditional formatting to make cells with Writer or Penciler in them different colors which will make it easier for me to do the next steps of getting the writers in one set of columns, and pencilers in another.
Now that we've applied Conditional Formatting to make it easier to see the cells of interest to us as we edit, let's use the Sort and Filter tools to organize our sheet in a different order or to only show rows with certain values in them.
Sometimes you need to change how the rows in your sheet are ordered to edit it more quickly or make things about your data more clear. A lot of you will have probably used Sort at some point to put items in your spreadsheet into alphabetical or numerical order, but just to quickly run down how, let's say that we wanted order the sheet by column A in alphabetical order so we'd know that any duplicate cells in that column will be next to each other.
Sorting changes the order that the rows are in on your sheet. But if you just want to focus in on rows with only certain values in a given column or columns without changing the order that rows are actually in on the sheet, you can use filters.
With this module, we've looked into how you can use Find and Replace to make batch-editing changes to clean up your data, and Conditional Formatting, Sort and Filter to better see what other places in your spreadsheet might need more attention. If you had any issues throughout the process and want to see a copy of what I did, go to this link and select that you want to make a copy of the spreadsheet Marvel 2001_Lesson1_End. Once you've made a copy, you'll be able to see what I did on my version of the sheet.
In the previous module, we've gone over some basic tools to help you clean up your data that's already there. With this section, we'll work on what you do when you want to make new columns with data based on information you already have on your sheet. Sometimes you got the data in whatever format you could and now you need to make it consistent for whatever analysis or graphing program you are using. We'll work on how to use formulas like RIGHT and LEFT in order to derive a set amount of text from the beginning or end of a cell and place that text into a new cell. We'll see how REGEXEXTRACT can be used in situations where the amount or placement of the text you want to select within a certain cell is inconsistent, in order to take a portion of text from within one cell and place it within another. We'll also work with CONCATENATE to join the values of cells together.
Sometimes because of how you got your data it has the information you need but it isn't formatted 100% correctly. If it's only a few cells, it's not hard to correct it by hand, but if it's whole columns where each cell is clumsily formatted in exactly the same way, you may be able to use some of Sheets' built-in formulas to pull out the information you want. The formula RIGHT will let you pull letters or numbers from the end of the text in a cell (the right-hand side) and LEFT will let you pull letters or numbers from the beginning of text in a cell (the left-hand side). Let's see how that will work on the data on this sheet.
At other times, the characters that you want to grab from a cell could be at the beginning of the text in that cell. This is where LEFT comes in. There wasn't a good place with this data to use it, so I created a new sheet within this project for you to practice that formula with.
Maybe instead of data that you want to separate from one column into two, you could have data in two separate columns that you want to make into one column. A first and last name or a street address, city, state and zip code, are some examples that come to mind of pieces of data that might start out being recorded separately, but that you might want to pull together. In this case of this data, the year and month of publication are being stored in separate columns. That is fine when we look at it, we can mentally put the two together and know what it's saying, but if we wanted to put this data into a graphing program, it wouldn't associate the two pieces of information together unless they were in the same column. Fortunately, we can use CONCATENATE to join them together.
With the previous two modules, we've seen how to get a specific amount of characters from the start or end of text in one column and push it into a new column as well as how to take the contents of two columns and put them together. However, there may be some circumstances where you want to take data from a column and put it into a new one but one or both of these complications means you can't just use RIGHT or LEFT
This is where Regular Expressions (or regex) can come in. I get more into Regular Expressions in another tutorial on the DAsH research guide, but basically it's a method supported by many programs and programming languages that allows you to not only search for certain exact key words or phrases but also for certain patterns of characters within a text. So in the case of that month example, if I knew that all the months in the column that I want to be able to pull the month from are the only places in their cell that use letters instead of numbers, I could use the REGEXEXTRACT to deliver only the group of letters from within that column.
A good quick reference sheet for Regular Expressions has been created by MIT and it will be the one I use throughout this tutorial. We'll keep it relatively simple here, limiting our exercise to just a few of regex's possibilities, but truly, there are a lot of things you can use Regular Expressions for.
So on this sheet Column B - Volume Information has the information about what volume of a comics title the issue is in. The thing is, it’s got the volume number attached, so a sample cell says Amazing Spider-Man Vol 2, but the only information I care about in that cell is the title Amazing Spider-Man . That Vol 2 will trip me up if I want to just see how many times that title appears across all its volumes or if I want to create a program that searches for that title name on a website to give me more information about it.
I could use the same Find and Replace feature from the last lesson if the thing I wanted to get rid of was only ever Vol 2, but the problem with that it isn’t just Vol 2 that is at the end of the string, some have Vol 1, some Vol 3. While I guess we could perform Find and Replace for each of those iterations until we’d covered all of them, that’s a lot of extra work, so its fortunate that we can use a regular expression to do some of the heavy lifting for us.
Using what we found with the cheatsheet, we can build a regular expression that will look for any kind of text that occurs in a cell before Vol and a digit.
To get a little more practice with regular expressions, let's click over to the sheet ExamplesForRegex. I've put in a couple of different tables that we can use to practice a few different scenarios for these.
In the first table, there are bunch of different addresses under where it says Street Address and a blank column next to it marked State Abbreviation. Let's say we wanted to quantify how many of these addresses were in each state, but when we collected the addresses, we didn't make a separate column for the abbreviation for the state. We can't use RIGHT or LEFT to get to the state since they're within the middle of the text, and they aren't consistently the same length from either end.
For the second table, let's say that we wanted to extract the year from the date that's in column D. If the people who'd recorded the date were consistent, this would be easy, however some dates have AD, BC or BCE at the end of them which means we can't use RIGHT to pull just the last 4 characters from the text in the column. Additionally we can't use LEFT to get to it from the other end because the months and days as they are written out here all have a different number of letters in them. So this is another occasion where REGEXEXTRACT can be very helpful.
With this module, we've gone over different formulas to be used to extract specific text from text used in a cell using RIGHT or LEFT if the portion required is in a specific location and consists of a uniform number of characters or REGEXEXTRACT if more flexibility on location or character number is needed. We've also seen how CONCATENATE can allow us to combine the text content of multiple cells. If you've had any issues following along and want to see a copy of my finished work for this lesson, go to this link to make a copy of the spreadsheet and see all of the formulas and the finished product.
Once the data within a spreadsheet is cleaned, it is easier to take stock of what you have in it. With this lesson, we'll use some of Sheet's functions to begin to get summaries of the data we have in the sheet. We'll see how this can be used to apply to text values and see many how many unique values are in a column using UNIQUE. We'll also use COUNTA and COUNTIF to see how many times different terms are used within a column or row.
Within column C - Title Information Text there is a row for every issue of a title that came out in 2001. For my project, I had to have a list of all the different titles published so that I could look the titles on that list up and get extra information on whether the superhero in it was a man or a woman. If I just wanted to get a list of every title included in this column, I couldn't just copy the whole column because there are a lot of duplicates. The UNIQUE function will just look at an entire column or row of data and just list all the unique values that are in it. So if I use that function on this column, no matter how many times "Deadpool" appears as the entry in column C it will only appear on the unique value list once. Awesomely, this even works if the column or row isn't sorted in alphabetical order.
This might be a thing you want to do if you want to get a list of all the different values you’ll have to feed into an API or search for in some other way. Rather than copying the list, and then using Remove Duplicates, then Sort to get the list alphabetical, this formula accomplishes the same thing, just in one step.
Boiling down all the unique values in a column can also be a way that you might wind up finding problems with your data. For example, if you have days of the week that people wrote into a spreadsheet by hand, you might discover that some wrote out the day in full, some wrote it abbreviated, and some misspelled it. If it only happened a few times, you might miss it until it came time to graph your data and then you'd have to go back and correct it all one by one. Even within this example, on the AllMonths sheet, we can see July appears two times on the list, one with a space in front, one without. That's something we'd look into correcting, but we'd want to figure out which was the right one first.
Creating this kind of summary of what values exist your data can also be a good first step to coming up with a summary for how many times those values occur within your data, so lets see how many times each of the months appear as a publication date on our sheet. .
Now that we've created a sheet that contains all the publication months listed for issues that came out in 2001 we can see how many issues came out each month using COUNTIF, which is a formula you can use when you want Sheets to tell you how many cells fit a certain criterion within a certain range.
This criterion can be numeric if the range you want to search is numbers, so, say if you had census data, you could use this function to count how many cells in a column for Age listed a number below 10 . If a column has alphabetical values, like in our example sheet, you can have it count how many cells in a row or column match a certain text either in its entirety, or if it begins or ends with a certain group of letters.
If you have a list of all the values in a column (like we do because we just used UNIQUE to get one) getting a count of how many time each of them occurred is pretty easy with COUNTIF
If instead of wanting to count when a certain condition is met, we simply want to count how many cells in a row or column have values in them, whatever those values are, we can use COUNTA
In columns J through N, the writers credited for an issue are listed, one name per cell and in columns O through V, the names of the pencilers credited for an issue are recorded the same way. If we want to find out the total number of writers or pencilers credited for each issue, we can use COUNTA to get that total for each row across each of those columns.
Go to the column where the totals for each issue's writers will go, column H and start typing the formula =COUNTA(
The formula will pop up as one of the options for you to select. Click on it, and Formula Help will give you more information about the formula.
It lets you know that you just need to give the formula a range (or ranges) to look through, and it will count up the number of cells with any text or numbers in them that are in that range.
So since we want the total for the number of writers on this issue, and we know that data is stored in columns, J-N we'll use that as the range for the counting to occur in. Click on J2 and then stretch the box all the way over to N2. The preview will let you know that the result of the formula is 1 which will match with what you see by looking at what the contents of those columns in this row are. Close out the formula by using an end parenthesis and hit enter. The final formula will look like:
Let's do the same thing but for the pencilers. The range that contains the names of the pencilers is columns O-V. So click in cell I2 and start another COUNTA formula. This time the range you will click and drag across will be O2:V2.The preview will tell you that the result is going to be 1 which matches with what you see when you look at this row. The final formula will be:
Highlight and copy H2 and I2. Then highlight the rest of the cells in those columns all the way down to where the data ends at row 541 and then paste. It will automatically amend the rows in the range so that the formulas are looking to the columns J-N for the writer total column H and O-V for the penciler column I, but with row number changed to refer to that particular row.
These are just some of the formulas you can use to create numerical summaries of your data. COUNTUNIQUE will count the number of unique values in a range. SUM will total up a range of numbers. SUMIF works similarly to COUNTIF, and will give you the sum of whatever cells within a range match your chosen criteria.
If you had any issues with following along with the formulas we've explored here, check out this finished form of the sheet I made by following these steps. If you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson3_End Choose Make a Copy. You can look through this copy of the sheet to see what my version of this with all the formulas looks like.
Once the data has been cleaned and organized, next steps can be taken. One step we can take is to augment the data in this table with data from another source. Using VLOOKUP, we'll be setting up a relationship between our main sheet and a sheet with additional title data in order to get information from that supplementary sheet onto the main sheet. With this accomplished we'll use the Explore and Column Stats function in order to get summaries of our data. We'll also see how to create a Pivot Table to put together our summarized data and see how to visualize that data in a Chart. These are key last steps to converting this collection of data into something understandable to an audience who just wants a summary of what you've found rather than all the raw and unprocessed numbers.
Some of you may be familiar with the concept of joining a map layer or table to another with ArcGIS, some may have worked with a database program like Microsoft Access in order to connect one sheet to another but Google Sheets has a function called VLOOKUP that lets you create something like a relational database to link data from different sheets on the same entity together.
Basically what a relational database lets you do is, if there is a key value that exists on two different sheets, like say, the name of the high school, you can take a value associated with that high school on one sheet, and also associate it with that name on the other sheet. So, let's say we had a table that had graduation rates by year for a bunch of different high schools that we got from the Board of Education, but we wanted to see what this information looked like on a map. The Board of Education graduation rate data didn't have coordinates for any of the schools recorded, so we couldn't do that with the information on the existing sheet. Next, we found a listing of coordinates for each of these high schools. But, not surprisingly, the table with the geographic information, didn't have graduation rates. With a relational database, the two tables can linked to each other with the name of the high school serving as a key, and the geographic information for say, Midtown High School can be brought into the table about its graduation rates each year.
This can be a very handy thing to know how to do when you are trying to unite data about the same person, place or other entity that you got from different sources.
In this instance, I have one sheet where I have the information about when each issue of Marvel Comics was published in 2001 and who worked on it: IssueInfo and another sheet where I have information on each title about what gender the superheroes in that title are: AllTitleHeroGender. Now if I didn't know about relational databases I could just go back and forth between one sheet and the other and cut and paste in the information, but it would take forever and I would inevitably make some mistakes.
Or we can use VLOOKUP (Vertical Lookup) and have a formula do the same thing at a much larger scale: look up the value we want to find on the AllTitleHeroGender sheet, and when it finds the row that value is in in the _Title Column, move over to the next column we want it to and return the value that is in the same row as the title we were looking for but in the _Title Protagonist(s) Gender column.
Whether you are trying to find out exactly what your data says for purposes of seeing what it says for your hypothesis or finding out what other questions it may answer, or whether you are at the stage where all of those questions are pretty answered and you just want to be able to have the right numbers to assign to your findings or tables to add to your paper or presentation, there are some quick ways you can make organized summaries in Sheets.
Column Stats will let you just create a quick chart or table giving you the information about what is in a column. Explore will let you create that kind of quick table, plus let you ask other questions and combine the information in multiple columns.
A Pivot Table is a function that a lot of spreadsheet programs have that lets you create a table to summarize the data you've accumulated into a bunch of smaller categories. You can use it to create sums or averages. It's a way to reorganize your data to make it clearer or to cross reference the contents of a few different columns.
Maybe you want to have a more flashy way to display your data than a table. If so, Google Sheets has a built in Charts function that can help you out. Just so you know, there are limits to how much customization you can do with these Charts, so if you're looking to do more, something like Tableau might be more your speed.
If you had any issues with following along with the formulas we've explored here or with creating the pivot tables or the chart, check out this finished form of the sheet I made by following these steps. If you aren't already logged into your Manhattan College Gmail account or another Gmail account, it will prompt you to do so. After you have logged in, a prompt will appear on the screen asking you if you want to make a copy of Marvel 2001_Lesson4_End Choose Make a Copy. You can look through this copy of the sheet to see what my version of this with all the formulas and the pivot table and chart looks like.
That's the end of this tutorial about how to use Google Sheets to clean, organize, summarize and eventually visualize your data. If there's anything that isn't covered that you want to know if you can do to your data in Sheets, please feel free to contact your instructor and me, your DAsH librarian. The contact information for me is on the side of the page.