BeautifulSoup is a web-scraping language created for Python. Web-scraping is the term for creating a program that will visit one or more webpages, and copy whatever information from the pages that is specified in the code. It lets you automate the process of obtaining data from the web rather than doing it manually.
For example, you may be looking to collect all the song lyrics of an artist so you can do a word frequency count. If the lyrics were stored on a web page as a whole album, you'd be in luck, and would only need to spend a few minutes copying and pasting the lyrics. But if they were stored in a separate page for each song, that cut and paste method would take an unnecessarily long time. The good news is, most sites have a template that they use when making multiple pages, and you can use BeautifulSoup to pull the information from that template, and print it to your Python console or put it into a text file.
If you know the URLs for each of the albums on the song lyric site, and know from looking at developer tools that the lyrics are always between tags labeled 'lyrics', then you can write a script that takes goes to those URLs, copies all the information in between tags that say 'lyrics' and writes it to a text file for you.
Some sites block you from web-scraping, but most do not. However, if your script is set to ask it for too much information too quickly, it might be disconnected for creating a strain on the server. We'll discuss how to try to avoid that.
You'll want to know the basics of both HTML and Python for this tutorial. It's intended for people who have some idea of how those languages are structured, just not what to do to create a web-scraping script. Both Codecademy and Lynda (which you can access with an NYPL card) have intro exercises for Python that would take a few hours to complete which will give you enough of a background for this (and be useful in other ways). W3 has a great introduction to HTML and Codecademy can help you out for this as well.
With this beginning section of the Beautiful Soup tutorial, we'll install the Beautiful Soup module for Python. We'll put it to the test by using it to parse through the HTML that makes up a sample web page, and see how different commands in Beautiful Soup can retrieve the elements we request from that web page. Building a script is all about taking a scaffolding approach, first making code that does the broad-strokes version of what you want, and then refining it until it does the more specific task that you need it to. With this section we'll see how Beautiful Soup lets us do the broad-strokes version of what we want to accomplish, taking information stored in elements from a web page and printing it to the console.
By the end of this whole tutorial, we'll have written code that will extract the data on who wrote or drew different issues published by Image Comics from that publisher's website and send that data to a file that we'll be using in later analysis. That script is the final product, and we need to build up to get there starting from the basics. You'll be learning in the next tab of the tutorial how to write code that prints out your specific target data from a website. In this section though, we'll be seeing how to tell Beautiful Soup to get information for us from a given website. We'll be running some basic code that extracts different elements from the web page and prints them to the console.
In the below text anything written in blue is code that you will be putting into your Jupyter Notebook console
It does what we asked it to and prints the sections between the links but still triggers an error. If these links were actually what we needed for our project, we could backtrack through the original variable and try to see what code the link went wrong on and change our code so it knew what to do with that link.
We could also set up a try and except block in our code, to tell it what to do if it reaches something it doesn't know how to handle. These can let you tell your code to take some different course of action when it reaches something on the page it doesn't understand what to do with, and then continue through the rest of the page. However, you've now both seen how to use Beautiful Soup to extract and print all of a certain element on the page, so it's time to move on to the next step and work on how to look at the source code of the page to find where the elements that you want are, and print just those elements.
Below is a copy of the Jupyter notebook I've set up in this lesson. I've put in comments to make each of the step clear, so take a look at it if you had any issues following along.
In the previous section of this tutorial you learned how to print all of a certain kind of element from a sample web page. But you can get more specific than that. Instead of printing all of the links, all of the text, you can target and print just one specific element or one group of elements, by looking at the structure of the page using developer tools in Google Chrome. This will let you locate where the information that you want on a webpage is located in that web page's HTML script. This skill will help you know how to locate the unique container that holds the data you want, so that you can target that container with your Beautiful Soup script. This tutorial is written for people who have some idea about how HTML is structured and how Python scripts work, so if you are unfamiliar with either, you'll get the most out of it if you first go through an HTML tutorial and Python tutorial at one of the links at the top of the page.
First, we'll see how we can use Beautiful Soup to extract the link to the cover image of the issue described on the page, then we will see how to extract the creative team (the writers, artists and cover artist) on the right side of the page. Obviously, these bits of information would be easy to copy and paste on their own if we only wanted them from this one webpage, but the idea is, if we can write a script that works for this one webpage, then we can ask this script to do the same on on tens or hundreds or thousands of the webpages at Image Comics that contain this same information.
Now that we have figured out how to get the single unique element we want out of the page, let's see if we can get a group of elements that we'd like. Remember, the eventual goal of this exercise is to get the listing of writers and artists for this issue (so we can then apply that script to many issue pages on Image Comics' website)
<h2>Writers</h2>\n<aside>\n<img alt="Leigh Alexander" src="https://cdn.imagecomics.com/assets/img/default-user-square.svg"/> <a href="https://imagecomics.com/creators/leigh-alexander">Leigh Alexander</a>\n</aside>
https://imagecomics.com/comics/releases/cemetery-beach-1 | Writer | Warren Ellis | Artist | Jason Howard |
https://imagecomics.com/comics/releases/paper-girls-25 | Writer | Brian K. Vaughan | Artists | Cliff Chiang |
https://imagecomics.com/comics/releases/cemetery-beach-1 | Writer | Warren Ellis | ||
https://imagecomics.com/comics/releases/cemetery-beach-1 | Artist | Jason Howard | ||
https://imagecomics.com/comics/releases/paper-girls-25 | Writer | Brian K. Vaughan | ||
https://imagecomics.com/comics/releases/paper-girls-25 | Artists | Cliff Chiang |
When you have the code working on a sample page that's representative of the larger site that you want to scrape your data from, then the next step is to see if you can apply it across multiple pages. I've uploaded the Jupyter Notebooks that I used below if you had any problems and needed to see how I accomplished certain tasks.
In the previous section of this tutorial, we used Beautiful Soup to extract the data that we wanted from the sample webpage of one issue of an Image Comics publication, and print it to the console. This was crucial to figuring out how to do this sample project, as it was proof of concept that a script could do this on a page like all the others that we need to get information from. However, the reasons that web-scraping scripts are handy isn't because they can do the same kind of copy and paste work that we could do ourselves from one page, but because we can apply that script to multiple pages and receive the data back in a format that is easily usable for number-crunching or text analysis purposes. In this next section, we'll rewrite this script so that it performs the same function it did on the sample webpage (extracting the creative team's names and roles) on multiple web pages in a row, and exports the data that we have found into a text file.
In general, it's a good idea to start with something that you know works, and go step by step to add new capabilities to your code.
If we right away throw in changes to our code that:
and then when we execute that new code we get an error, we won't know which of the three things that we added went wrong. So let's start from something we know works, and add one thing at a time. That way if the new part breaks the code, we'll know which part of the code needs to be fixed.
If you were tripped up anywhere in the process, a Jupyter notebook file will be at the bottom of this section.
If we only had a small amount of web pages to deal with, we could just continue to print all our data to a console and then copy and paste it into whatever program we'd be using to analyze the data. On the other hand, if we were writing this script to make it be applied to a hundreds or thousands of pages, then we'd want the data dumped into a document to better keep it organized and to give us something more permanent to look back on as we continue analyzing our data. In this next section, we'll add a write function to your code and make it so that the information that we are getting is going to a file. We'll also put a time-delay into your code so that we don't get kicked out by the site's server for making so many requests.
If you had an issue at any part of the script, take a look at the Jupyter notebooks that I've created below with this code.
At the end of the last section, we took our script that could get the information that we wanted from one webpage, and transformed it into a script that took that same information from multiple webpages, and sent that information to a file. We're not done though, because that file still holds its data in a pretty messy format. We'll take the data that we got in the last lesson and transform it into a well-ordered text file that can be pasted into a spreadsheet. This is quick to do using shortcuts available to us in the text-editor program Notepad++.
With an extremely well-polished script, we could get more well-organized data, but with how quick this clean-up is, I'll sooner take the time that I would have spent fine-tuning my script to get me the data looking the way that it wanted, and spend it on some other aspect of my project. It would have certainly taken me longer to get this script to export the data perfectly than the 10 minutes this cleanup will take.
You've now created a script that takes the information that you want from a website, and with a little modification, turns it into an organized table for you. If you want to see what this text file looks like now, check the below file
This full lesson can be applicable to all kinds of webpages that you want to get information from. When you want to use Beautiful Soup to scrape a web page, you just need to:
There can be roadblocks along the way, and not all sites are organized as straightforwardly as this one, but with these general steps, gathering your data from the web with Beautiful Soup can be a nice orderly process.