Skip to Main Content

DAsH

Research Guide for DAsH (or digital humanities) resources and tools

What is BeautifulSoup?

BeautifulSoup is a web-scraping language created for Python. Web-scraping is the term for creating a program that will visit one or more webpages, and copy whatever information from the pages that is specified in the code. It lets you automate the process of obtaining data from the web rather than doing it manually.

For example, you may be looking to collect all the song lyrics of an artist so you can do a word frequency count. If the lyrics were stored on a web page as a whole album, you'd be in luck, and would only need to spend a few minutes copying and pasting the lyrics. But if they were stored in a separate page for each song, that cut and paste method would take an unnecessarily long time. The good news is, most sites have a template that they use when making multiple pages, and you can use BeautifulSoup to pull the information from that template, and print it to your Python console or put it into a text file.

If you know the URLs for each of the albums on the song lyric site, and know from looking at developer tools that the lyrics are always between tags labeled 'lyrics', then you can write a script that takes goes to those URLs, copies all the information in between tags that say 'lyrics' and writes it to a text file for you.

Some sites block you from web-scraping, but most do not. However, if your script is set to ask it for too much information too quickly, it might be disconnected for creating a strain on the server. We'll discuss how to try to avoid that. 

You'll want to know the basics of both HTML and Python for this tutorial. It's intended for people who have some idea of how those languages are structured, just not what to do to create a web-scraping script. Both Codecademy and Lynda (which you can access with an NYPL card) have intro exercises for Python that would take a few hours to complete which will give you enough of a background for this (and be useful in other ways). W3 has a great introduction to HTML  and Codecademy can help you out for this as well.

Using Beautiful Soup

Lesson Goal

With this beginning section of the Beautiful Soup tutorial, we'll install the Beautiful Soup module for Python. We'll put it to the test by using it to parse through the HTML that makes up a sample web page, and see how different commands in Beautiful Soup can retrieve the elements we request from that web page. Building a script is all about taking a scaffolding approach, first making code that does the broad-strokes version of what you want, and then refining it until it does the more specific task that you need it to. With this section we'll see how Beautiful Soup lets us do the broad-strokes version of what we want to accomplish, taking information stored in elements from a web page and printing it to the console.

Getting Started

  • Make sure that you have Anaconda with Python 2.7 installed on your computer. Beautiful Soup also now works in Python 3, but Python 2.7 is a bit easier to explain and so will be used in this tutorial. We'll be using Jupyter Notebook from the Anaconda Python platform throughout this lesson, so you can install both that and Python by downloading it at Anaconda's site..  Choose the 2.7 version.
  • Go to the Beautiful Soup download site, and download the file to your Python 2.7 folder.
  • If you have experience using Python and using pip modules, you can install it in your command prompt terminal with the simple command of
    pip install beatifulsoup4
  • There are more detailed instructions on the installation process on the Programming Historian website. Again, this tutorial is meant for people who have some familiarity with Python so we'll be focusing more on how to use Beautiful Soup once it have installed.
  • If Anaconda has been installed correctly, you can find Jupyter Notebook in your start menu if you have Windows or Finder if you're on a Mac.  You can also start it by opening your command prompt terminal (search for Command Prompt) in Windows and typing in
    jupyter notebook  
    This will open Jupyter Notebook in your default browser. Jupyter Notebook is an application that lets you create code in an environment where you can execute it easily and save your results unlike if you were writing your code either directly in a command prompt window, or just saving it as a .py file which would include the code but not the results. For this reason it's ideal for tutorial situations like this. If you're more comfortable writing and executing Python in another program, feel free to use that but this tutorial will be written assuming that you are working in Jupyter Notebook.
  • When you have started up your copy of Jupyter Notebook, select the New  dropdown in the upper right corner and choose Notebooks -> Python 2.
  • This will open up a new Notebook and have a blank console ready for you to type in information. It's important to remember that you press enter to give yourself a new line in the text interface. You press shift-enter to execute the code.

Writing Basic Code in Beautiful Soup

By the end of this whole tutorial, we'll have written code that will extract the data on who wrote or drew different issues published by Image Comics from that publisher's website and send that data to a file that we'll be using in later analysis. That script is the final product, and we need to build up to get there starting from the basics. You'll be learning in the next tab of the tutorial how to write code that prints out your specific target data from a website.  In this section though, we'll be seeing how to tell Beautiful Soup to get information for us from a given website. We'll be running some basic code that extracts different elements from the web page and prints them to the console. 

In the below text anything written in blue is code that you will be putting into your Jupyter Notebook console

  • Though Python without any extra modules has great functionality, for most code you write, you'll need to import modules. It's a good idea to do this at the start of your script to get it out of the way. So start by importing Beautiful Soup, along with urllib2 which is a Python module that lets your code interact with URLs on the internet. Press shift-enter. That will execute this code, and make it so that when we use functions, commands, methods or variables from these modules later in our code, they will be recognized and not spark an error.
    from bs4 import BeautifulSoup
    import
    urllib2
  • We'll create a variable to store the web address of the page that we'll be interacting with to see the capabilities of Beautiful Soup. Since Python allows us to name things pretty broadly, let's use that to give it the easily understandable name of url (as it is the URL). We'll set it to be our sample page. This page is on the site I want to be scraping data from, and is set up the same way as the other pages that I need data from, so I'm using it to test my script. Run the code to set this variable.
    url="https://imagecomics.com/comics/releases/24-panels-tp"
  • Once the url variable has been defined, we can interact with it. The code below first opens that url using the module urllib2's urlopen function and stores it as the variable page. The second line here will run the page that's been opened through the BeautifulSoup parser and stores it as the variable soup.
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
  • If executing the code gives you red text like the below, don't worry, that's an okay warning to trigger.
    A warning about html parsers in a Python console
    The console is just letting us know that since we didn't specify which of its available html parsers to use, Beautiful Soup is just using the default. An HTML parser is what Python uses to identify what kind of elements are on a web page's source code, where they start and end, and how the elements are structured. For a webpage like this, we don't need to worry about which kind of parser to specify, but if you had an HTML or XML page with special qualities, we could specify a different parser in the code using the formatting described in the warning.
  • The web page has been parsed into its component parts and stored in the variable of soup. By interacting with this variable you can look for different parts of the page and interact with them. For instance, let's say that you wanted to find all the headings on the page, you could use the findAll function to search for everything within the tag h2 and then store it in the variable headings. You can find other commands in Beautiful Soup's documentation.
    headings = soup.findAll('h2');
  • To see what you've actually stored in that variable, you can print its contents to the console. Execute the below command.
    print headings
    Contents of headings printed to console between tags
  • The variable we created does have a result we can print. But it is still a little messy. The formatting tags from the HTML are still there, and instead of each heading being its own separate entity, they are grouped together in a clump.If we didn't have Beautiful Soup, we could still use text editing to strip out those tags, but we can do it a lot neater with this module. 
  • To treat each of these headings as an individual item, we will use a for loop. A for loop is a structure used in Python to split a group into component items, and perform the same tasks one after the other on each of the items. We'll use the local variable of h for each of the individual items that make up the variable headings. Then, on each of those items h, we'll use the function .text to extract the text (or what's written between the <h2> </h2> tags) and print it. Run this code
    for h in headings:
        print h.text
  • Our results should be the below, and you'll see that each of the text within the <h2> tags, (like what was <h2> Writers </h2>) has been printed it on its own line in the console. The reason this is one of the first steps I'm showing you is often the .text method is what you'll use to get the human-readable data off the page for you to gather together into a chart.
    Headings written to the console without their tags
  • Sometimes though, you'll want to get to data invisible to you on the page, like say the address behind a hyperlink or an embedded image. Those are stored in the element a or img. This is what you'd want to do if this were a page that had a lot of hyperlinks on it, (maybe a search results page), but the links weren't visible on the page for you to copy and paste. You'll create a new variable links and use the findAll  method again, but this time you'll be looking for the element a. Run this code.
    links = soup.findAll('a');
  • If you try to print the links, you'll see that it includes a WHOLE lot of other information that you probably don't need. It extracted all of the information on the page that was included within the a tag, whether it was the style, text or the href link.
    print links
    The entirety of links on the page printed to console
  • When we take a look at the results, we'll find that the part that actually controls where a link clicks to or draws information from is in the a tag itself after the words href.
    <a class="takeover-link" href="https://imagecomics.com/comics/series/sea-of-stars">
  • If we want to get at that component directly for the entire variable of links, we'll get an error, since while a lot of the components of that variable have href as one of their attributes, Beautiful Soup gets confused when we try to apply it to the whole thing
    print links.href
    Text says Attribute Error: 'ResultSet' Object has no attribute 'href'
  • Like we did in our exercise with headings, we'll set up a for loop to break the variable links up into its component parts, and then perform an action on each of those parts individually, one after the other.  We'll be using the attrs function to ask it to extracting the href attribute from within each a tag.
    for a in links:
        print a.attrs['href']

    The links printed to the console for several lines before an error message is sparked.

It does what we asked it to and prints the sections between the links but still triggers an error. If these links were actually what we needed for our project, we could backtrack through the original variable and try to see what code the link went wrong on and change our code so it knew what to do with that link.

We could also set up a try and except block in our code, to tell it what to do if it reaches something it doesn't know how to handle. These can let you tell your code to take some different course of action when it reaches something on the page it doesn't understand what to do with, and then continue through the rest of the page. However, you've now both seen how to use Beautiful Soup to extract and print all of a certain element on the page, so it's time to move on to the next step and work on how to look at the source code of the page to find where the elements that you want are, and print just those elements.

Below is a copy of the Jupyter notebook I've set up in this lesson. I've put in comments to make each of the step clear, so take a look at it if you had any issues following along.

Learning Goals

In the previous section of this tutorial you learned how to print all of a certain kind of element from a sample web page. But you can get more specific than that. Instead of printing all of the links, all of the text, you can target and print just one specific element or one group of elements, by looking at the structure of the page using developer tools in Google Chrome. This will let you locate where the information that you want on a webpage is located in that web page's HTML script. This skill will help you know how to locate the unique container that holds the data you want, so that you can target that container with your Beautiful Soup script.  This tutorial is written for people who have some idea about how HTML is structured and how Python scripts work, so if you are unfamiliar with either, you'll get the most out of it if you first go through an HTML tutorial and Python tutorial at one of the links at the top of the page.

First, we'll see how we can use Beautiful Soup to extract the link to the cover image of the issue described on the page, then we will see how to extract the creative team (the writers, artists and cover artist) on the right side of the page. Obviously, these bits of information would be easy to copy and paste on their own if we only wanted them from this one webpage, but the idea is, if we can write a script that works for this one webpage, then we can ask this script to do the same on on tens or hundreds or thousands of the webpages at Image Comics that contain this same information.

Getting Started

  • Open up the same sample URL that we were using in the last lesson - https://imagecomics.com/comics/releases/24-panels-tp in Chrome
  • Go to the cover image for the issue (the large image below to the left of the blurb and title of the issue), right click and choose Inspect from the menu. A box will open up at the bottom (or right side) of the browser that contains the HTML for the page. Since we clicked originally on that image, it automatically takes us to the part in the script that corresponds with the image that we clicked on.
  • This console is interactive. By hovering on parts of the HTML in the console, we see those parts get highlighted on the web page along with the container that they are held in. If you move onto the regular web page you can right-click and choose Inspect and the HTML script will scroll to the section of the HTML that pertains to that element on the page. Move around and try hovering on different parts of the HTML or right-clicking on different objects on the page until you have a good idea of how to see how the display on the page and its corresponding section of the script are linked.
  • Go back to look for the part of the HTML that corresponds with the cover image on the page.
    Developer console highlighting the part of the source code that contains the image
  • We are trying to look for an element that is unique to the part of the page that we are looking to extract. The image itself is put on the page with an img tag, but you can see by jumping and clicking around on the page that there are a lot of items on the page that are held within an img tag. 
  • When the most specific element that we want on the page isn't located in a unique enough container for us to extract only that element and not all other elements of the same type, the next step is to move up one level in the script and see if that portion will help us. The img tag is housed within an a tag. That tag also doesn't seem to have anything unique about it, no class or id that would differentiate it from the other a tags on the page, so let's go up to the next level.
  • Now we've found something, the div tag one level above has a class as one of its attributes that looks unique.  Since it has nested in it the a link that is wrapped around the img tag that has the element that we want, we can try pulling the element from within this div that we can call with its unique (hopefully) class. 
    <div class="cover-image image-link-gallery">
  • The div and its class seem like a good bet for how we can get this element on the page, so let's try to target that specific container and extract the img tag within it.

Using Asset location in Code

  • Let's open a new Jupyter notebook to write a script using this asset location that we've found to print out only this one image tag from our sample page.
  • Start off  by running the same beginning parts of our script as the last exercise. The beginning section imports the modules needed to run the script. Then it specifies the URL we'll be looking at, and opens and reads it in as the variable soup. For more on the thought process behind this beginning code, go back to the last section of the tutorial.
    from bs4 import BeautifulSoup
    import urllib2
    url="https://imagecomics.com/comics/releases/24-panels-tp"
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
  • We'll get the same  warning as last time that we didn't define a parser, so it will use the default. Ignore it.
  • Just to show why we need to target a specific location to find the img tag that we want, let's see what happens when we look for all the images on the page. We'd do this by using the findAll function to search for all the instances of img. Try executing this code.
    main_image = soup.findAll('img');
    print main_image
  • This command does indeed list many more images than the one that we want, so let's proceed to target the div that we know that the cover image lives in.
  • From the above section where we played around with the console to read the HTML structure of the page, we found that the img tag for the cover image was within an a tag which was within a div that had the class of  'cover-image image-link-gallery'. Let's see if what we get if we target that div, and print it.
  • Up until now, we've only targeted elements or the text or other items associated with them. However, you can use the findAll function to look for elements that have certain attributes, by adding that in as a parameter after the element that you are looking for, using the format attrs={'key': 'value'} This would work for all kinds of things, classes, ids, style, anything that is contained within the front part of an element tag that has both a key (class, id, font-size) and a value assigned to that key. 
  • In this case, you want it to look for the div that has the attrs of the 'class': 'cover-image image-link-gallery', so run the following code
    cover_image = soup.findAll('div',attrs={'class':'cover-image image-link-gallery'});
    print cover_image
  • This prints out all of the contents within that div, so you can see that you have targeted an item on the page that actually exists, and only exists in one place. Next we'll look to try to get just the image tag (the highlighted part) below.
    [<div class="cover-image image-link-gallery">\n<a href="https://cdn.imagecomics.com/assets/i/releases/390097/24-panels-tp_cf329f75a5.jpg">\n<img alt="24 Panels TP" class="with-border" src="https://cdn.imagecomics.com/assets/i/releases/390097/24-panels-tp_19ac80b44e.jpg"/>\n</a>\n</div>]
  • Like we did in the previous section of the tutorial, we'll create a for loop, to iterate through each of the sections of this div stored as cover_image and look of all the tags that are an img tag.
    From within that img tag, we'll send Beautiful Soup through all parts of it using a second for loop and tell it to print just the part that is the attribute src, since a look at the above output that came from printing cover_image shows that the URL that we want comes within the part of the img tag that comes after src =
  • for div in cover_image:
        links = div.findAll('img')
        for img in links:
            print img.attrs['src']
  • Execute this code and the returned information should be the image source, which if we paste into the browser bar, should take us directly to the cover image by itself.
    https://cdn.imagecomics.com/assets/i/releases/390097/24-panels-tp_19ac80b44e.jpg
  • If you want to see the code for this in a saved Jupyter Notebook, it is at the bottom of this tutorial

Finding Multiple Elements

Now that we have figured out how to get the single unique element we want out of the page, let's see if we can get a group of elements that we'd like. Remember, the eventual goal of this exercise is to get the listing of writers and artists for this issue (so we can then apply that script to many issue pages on Image Comics' website)
Location of artist and writer data on the issue page.

  • Right click on the heading of writers and choose Inspect. When you open the developer tools, you'll see that it is an h2 element within a div with the class of role. But when you hover over the div of role, you'll see that this particular div only includes the writers and not the artists. When you'll look, you'll see that writers and artists both have a div with that class, but that they aren't both in the same div.
    Location of writer's information in the source code
  • Let's see what happens when you target just divs with the class of role
  • Open a new Jupyter notebook
  • Start off your script the same way with importing the modules you'll need, saving the URL you want as a variable, then using Beautiful Soup to read the page. Then execute this starter code with shift-enter or by running it.
    from bs4 import BeautifulSoup
    import urllib2
    url="https://imagecomics.com/comics/releases/24-panels-tp"
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
  • You'll get that error message that you can ignore. Next, let's try to target the div with the class of role using the findAll method and print all the results of it
    creative_team = soup.findAll('div',attrs={'class':'role'});
    print creative_team
  • You'll see creative_team includes all the information that's within a div with a class of role, and by using Ctrl+F you'll see that there are in fact 3 different divs that have this class, and it's printed each of them.
    The word role highlighted 3 times within the results printed to the console
  • If we were only searching this one page this would be fine, after all, combined, the three do include all the information that you want, they're just in separate groupings. But we are planning on using this script on a bunch of different pages. So ideally, there will only be one entry per page we are scraping, so we don't have to do extra work collating it. This means we may have to revise this code to work better for use with multiple pages.
  • Let's first worry if we can target the text we want within these divs. Similar to above where what we really wanted was just the one img src link so we had to use a for loop to get to it, what we want here is just the text within a tag, so from this we only want the highlighted sections. We'll use the .text method you used in the last tutorial to get the the text between the h2 tags on all the tags that have text between them.
    <h2>Writers</h2>\n<aside>\n<img alt="Leigh Alexander" src="https://cdn.imagecomics.com/assets/img/default-user-square.svg"/> <a href="https://imagecomics.com/creators/leigh-alexander">Leigh Alexander</a>\n</aside>
  • We can use the print function to see how many different entries we are actually printing to the console by not just printing the text between the tags, but also the url variable that represents the page. So we're using a for loop to iterate through each different part of creative_team, to perform the below function on each part (represented by the local variable c)
    for c in creative_team:
            print url + c.text
  • Unfortunately, it looks like each div with the class of role is coming up as its own entry, which would be a problem for later when we try to create a table from this data.
    Results printed to console. After each image url there is a series of names separated by blank lines.
  • The reason this is a problem is we eventually want to just be able to apply a few formatting tricks to our data and have it look like this
    https://imagecomics.com/comics/releases/cemetery-beach-1 Writer Warren Ellis Artist Jason Howard
    https://imagecomics.com/comics/releases/paper-girls-25 Writer Brian K. Vaughan Artists Cliff Chiang
  • But if each div is its own entry it will instead look like the below. You'll be separating data that you want to keep all categorized by the same URL which will mean much more data cleanup.
    https://imagecomics.com/comics/releases/cemetery-beach-1 Writer Warren Ellis    
    https://imagecomics.com/comics/releases/cemetery-beach-1 Artist Jason Howard    
    https://imagecomics.com/comics/releases/paper-girls-25 Writer Brian K. Vaughan    
    https://imagecomics.com/comics/releases/paper-girls-25 Artists Cliff Chiang    
  • It is fine that our first attempt didn't work. Writing code is about revising and adaptation, so let's try moving up to the unit that the div with the class of role is nested in.  By scrolling up to the higher level of code from it in the console, we'll see that the div with the class of cell medium-5 large-4 credits  contains the three divs that we targeted before, and it doesn't look like it contains anything extra that we don't need. Let's test that out by doing a new version of our script that targeted the div with the class of role and change it to target the new class of cell medium-5 large-4 credits
    The part of the Issue entry page containing the creative data is highlighted and designated the div name mentioned.
  • Adjusting a script we already wrote to look for a new parameter is super easy and we just change role in the script we wrote to cell medium-5 large-4 credits and run it
    creative_team = soup.findAll('div',attrs={'class':'cell medium-5 large-4 credits'});
    print creative_team
  • We'll get a result that looks a lot like the first one. but now it's all wrapped within the tag of< div class="cell medium-5 large-4 credits">
  • We kept the same variable of creative_name, just changed the value that was stored in it, so we can just rerun your code we already wrote that iterates through all the elements within it, and pulls the text between the tags.
    for c in creative_team:
            print url+c.text
  • Now you'll see that the URL for the page is only printed once to the console, with the rest of the data that we pulled following it, which is just what we want.

When you have the code working on a sample page that's representative of the larger site that you want to scrape your data from, then the next step is to see if you can apply it across multiple pages. I've uploaded the Jupyter Notebooks that I used below if you had any problems and needed to see how I accomplished certain tasks.

Learning Goals

In the previous section of this tutorial, we used Beautiful Soup to extract the data that we wanted from the sample webpage of one issue of an Image Comics publication, and print it to the console. This was crucial to figuring out how to do this sample project, as it was proof of concept that a script could do this on a page like all the others that we need to get information from.  However, the reasons that web-scraping scripts are handy isn't because they can do the same kind of copy and paste work that we could do ourselves from one page, but because we can apply that script to multiple pages and receive the data back in a format that is easily usable for number-crunching or text analysis purposes. In this next section, we'll rewrite this script so that it performs the same function it did on the sample webpage (extracting the creative team's names and roles) on multiple web pages in a row, and exports the data that we have found into a text file.

Getting Started

  • Open the sample webpage that we have been using: https://imagecomics.com/comics/releases/24-panels-tp . Next we'll be adding a few other pages to the mix so in different tabs, open these 3 more web pages: https://imagecomics.com/comics/releases/cemetery-beach-1, https://imagecomics.com/comics/releases/paper-girls-25, https://imagecomics.com/comics/releases/monstress-18.
  • Click to one of the new webpages. Using developer tools, right click on the place where the writer and artist credits are listed to verify that they are stored within a div with the same class as on the trial page: cell medium-5 large-4 credits. Most sites, luckily including this one, use the same template on all their pages that have the same kind of data on them, so they can do things like apply CSS styles to elements of the same class or ID.  Some won't change over older content if they update their site, in which case you'd probably have to write a different script for the older pages. It is always a good idea to confirm that the page you are targeting uses terms that are also used on the other pages you want to get data from, so you aren't writing a script that will work on an individual page, rather than all the pages that you want to scrape.
  • Once you have confirmed that the script that you wrote for that first sample page will work on these new ones too, open a new Jupyter notebook for this next project

Applying the Code to Multiple Pages

In general, it's a good idea to start with something that you know works, and go step by step to add new capabilities to your code.

If we right away throw in changes to our code that:

  1. Change how many web pages the script applies to
  2. Change where it writes the data to
  3. Change what information the script includes in the data that it writes out

and then when we execute that new code we get an error, we won't know which of the three things that we added went wrong. So let's start from something we know works, and add one thing at a time. That way if the new part breaks the code, we'll know which part of the code needs to be fixed.

  • In the new notebook, add in the code that we have written already that we know works to print the creative team's information into the console.
    This script imports the modules needed to execute the code, adds a variable URL that is the web page you want to harvest the information from, looks in that page for the div that contains the credits, then prints the text within each of the elements of that div.
    from bs4 import BeautifulSoup
    import urllib2
    url="https://imagecomics.com/comics/releases/24-panels-tp"
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    creative_team = soup.findAll('div',attrs={'class':'cell medium-5 large-4 credits'});
    for c in creative_team:
            print url + c.text
  • Execute this code to make sure that it works, and we'll see the creative team on that trial page print to the console, just like it did at the end of the last exercise.
  • Our next step is we want to run the same script that we did for the 24 Panels comic, on the other URLs for issues that we just opened in new tabs. Now we could just copy the code and keep changing the URL that the url variable directs to but that would defeat the whole purpose of writing a web-scraping script and automating the process. In effect, we do the same thing by adding a for loop. Similar to what we did when we wanted the .text function to go one by one through the different elements in the div, a for loop will apply the code we applied to our sample URL to our new list of URLs one after the other.
  • We'll change the url variable to instead lead to a list of the URLs for the pages we want, and change its variable name to urls, like the below. Each URL will need to be in quotation marks and separated from each other by a comma. Because this new group is now a list, instead of a single equation or value, it is placed between brackets [ ].
    urls = [ "https://imagecomics.com/comics/releases/24-panels-tp", "https://imagecomics.com/comics/releases/cemetery-beach-1", "https://imagecomics.com/comics/releases/paper-girls-25", "https://imagecomics.com/comics/releases/monstress-18"]
  • Above this we'll type in the beginning of a new for loop, making the local variable url. Since that's the same variable we used when writing the code that only looked in the sample page, we don't need to change much else about our code. In effect, we are just now telling the code to take all the previous actions we had it take on the 24 Panels URL, and then when it finishes, to move on to the next URL on the  list in the variable urls.
    for url in urls:
  • Below this new code, indent the rest of the code that's already written, including that of the other for loop. That makes it clear which levels of code are being repeated for each of these urls. It should look like this, except the urls list should all be on one line.
    from bs4 import BeautifulSoup
    import urllib2
    urls = ["https://imagecomics.com/comics/releases/24-panels-tp", "https://imagecomics.com/comics/releases/cemetery-beach-1", "https://imagecomics.com/comics/releases/paper-girls-25", "https://imagecomics.com/comics/releases/monstress-18"]
    for url in urls:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        creative_team = soup.findAll('div',attrs={'class':'cell medium-5 large-4 credits'});
        for c in creative_team:
            print url + c.text

     
  • Execute the code, and when you scroll through, you'll see that after that first group of writers and artists, there are other groups printed to the console too
    The results printed to the console. The role is followed by the creative's name, and it's repeated.
  • Let's make another small change to make it clear in the print out when one webpage's content starts and when one ends. This will also make sure that the url text doesn't run into the creative_team text. The print function works on variables or parts of variables like url and c.text, but it can also print out other formatting text
  • We'll add a pipe character "|"  to the beginning of the printout to indicate when a new page's data is beginning, then we'll keep in the url variable, so that you'll be able to tell which page the information came from, then add a space, expressed by  " " then end with the c.text that will give you the creative team's data from the page. Anything that isn't a variable needs to be written in quotation marks. The reason we used the | character is because that is one that would be extremely unlikely to show up as part of our data, so we can easily replace it when it comes time to clean up our data, without screwing up anything important.
    So the last line of your code will now look like the below
    print "|" + url + " " + c.text
  • Go ahead and execute your code.Now the output indicates when there is a new page's data appearing by putting in the pipe character in front of the URL. This will come in handy later when we have to organize the data that we send into a file into a proper spreadsheet table.

If you were tripped up anywhere in the process, a Jupyter notebook file will be at the bottom of this section.

Exporting your Data to a File

If we only had a small amount of web pages to deal with, we could just continue to print all our data to a console and then copy and paste it into whatever program we'd be using to analyze the data. On the other hand, if we were writing this script to make it be applied to a hundreds or thousands of pages, then we'd want the data dumped into a document to better keep it organized and to give us something more permanent to look back on as we continue analyzing our data. In this next section, we'll add a write function to your code and make it so that the information that we are getting is going to a file. We'll also put a time-delay into your code so that we don't get kicked out by the site's server for making so many requests.

  • Open a new notebook, paste in the below code we've already written and execute it. Like in the last section, we'll want to start from a point where we know our code works before we start making changes to it, and preserve that working code so it will be easier to backtrack to if we have an issue somewhere in the process of adding new functions. The urls should all be on one line, it only moves it over because of the formatting on this guide.
    from bs4 import BeautifulSoup
    import urllib2
    urls = ["https://imagecomics.com/comics/releases/24-panels-tp", "https://imagecomics.com/comics/releases/cemetery-beach-1", "https://imagecomics.com/comics/releases/paper-girls-25", "https://imagecomics.com/comics/releases/monstress-18"]
    for url in urls:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        creative_team = soup.findAll('div',attrs={'class':'cell medium-5 large-4 credits'});
        for c in creative_team:
            print "|" + url + " " + c.text
  • At the top of the script with the other modules that you are importing add a line that says the below. This is a module that lets Python access time-related functions
    import time
  • The time module will let us make an addition to the for loop that will have the program rest a second before executing the script again. Add this to the bottom of your script after your print command. This command will have it sleep (pause, basically) for a second at the finish of each URL that it harvests the data from before it moves on to the next one.
    time.sleep(1)
  • Speaking of URLs, let's expand the number of URLs in that variable so we can better test if our script is going to work on a larger project. Replace the urls variable with the below. Remember, do not include line breaks, make all the URLs included on one line. This is just adding other pages that have the same kind of data and setup that we have written the other script for.
    urls = ["https://imagecomics.com/comics/releases/24-panels-tp", "https://imagecomics.com/comics/releases/cemetery-beach-1", "https://imagecomics.com/comics/releases/paper-girls-25", "https://imagecomics.com/comics/releases/monstress-18", "https://imagecomics.com/comics/releases/rock-candy-mountain-7","https://imagecomics.com/comics/releases/saga-54","https://imagecomics.com/comics/releases/elsewhere-5","https://imagecomics.com/comics/releases/rose-9","https://imagecomics.com/comics/releases/skyward-1","https://imagecomics.com/comics/releases/the-fix-13"]
  • We've made a few changes to the code now so execute it to see if it still works. Now instead of all of the results appearing at once, it takes about 10 seconds. We've added more urls that it is collecting data from and have added in the code to get it to rest for a second in between each for loop's execution, so the longer time to execute makes sense.
  • To tell the program to export our data to a file, firstly we'll need to tell it what file to be exporting that data to. Above the urls section of your code, add in this new variable f, and set it equal to the below. This uses the open function to create or open a text file, and tells the code to write to it with "w".  If this file doesn't exist, then the open function will create it. I'm naming this file Comics_Scrape_Test.txt, but if you wanted you could name it something else, what matters is that there are no spaces in the name and that it ends in .txt
    f = open("Comics_Scrape_Test.txt","w")
  • At the very bottom of your code, between the line with the print command and your new bit of code telling the program to sleep, we'll add a new command to take what we are printing to the console, and write it to the file we asked our code to create at the top of the script. With the below we're telling it to in addition to write that same information that we're printing to the file you created with the creation of your f variable. Notice how it contains the exact same code in between the parenthesis that follows print in the script we executed previously.
    f.write("|"+ url + " " + c.text)
  • At the bottom of the code, outside the indentation created by the for loops, write the below. This closes the file when everything has been written to it. If you don't close f, as long as the Jupyter Notebook is open, your file will look blank to you when you open it.
    f.close
  • Some people like to get rid of the print function once they see that they're getting the data that they need, but I'll leave it in the code for now just so I know that it's getting the data from the website. That way if the data that I'm trying to get doesn't show up into my text file correctly, I'll know that the right data is there, there's just a problem with the command to write it to the file.
  • Once your code looks like the below, execute it.
    from bs4 import BeautifulSoup
    import time
    import urllib2
    f = open("Comics_Scrape_Test.txt","w")
    urls = ["https://imagecomics.com/comics/releases/24-panels-tp", "https://imagecomics.com/comics/releases/cemetery-beach-1", "https://imagecomics.com/comics/releases/paper-girls-25", "https://imagecomics.com/comics/releases/monstress-18", "https://imagecomics.com/comics/releases/rock-candy-mountain-7","https://imagecomics.com/comics/releases/saga-54","https://imagecomics.com/comics/releases/elsewhere-5","https://imagecomics.com/comics/releases/rose-9","https://imagecomics.com/comics/releases/skyward-1","https://imagecomics.com/comics/releases/the-fix-13"]
    for url in urls:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        creative_team = soup.findAll('div',attrs={'class':'cell medium-5 large-4 credits'});
        for c in creative_team:
            print "|" + url + " " + c.text
            f.write("|"+ url + " " + c.text)
            time.sleep(1)

    f.close
  • When your code finishes executing, the bracket next to the place where your script is will have a number in it instead of an asterisk. Navigate in your Explorer or Finder to the folder that the Jupyter is operating out of (in my case it's my user folder) and find the file you made Comics_Scrape_Test.txt. Preview or open it in notebook and it should contain the same data that's in your console, though it's formatted a bit awkwardly.
  • If the file is blank, try restarting the kernel on your Jupyter notebook. This is under Kernel, choose Restart.
  • The resulting file still does look a bit of a mess though, so the next module will walk you through how to turn this file into something closer to a table or chart.

If you had an issue at any part of the script, take a look at the Jupyter notebooks that I've created below with this code.

Learning Goals

At the end of the last section, we took our script that could get the information that we wanted from one webpage, and transformed it into a script that took that same information from multiple webpages, and sent that information to a file. We're not done though, because that file still holds its data in a pretty messy format. We'll take the data that we got in the last lesson and transform it into a well-ordered text file that can be pasted into a spreadsheet. This is quick to do using shortcuts available to us in the text-editor program Notepad++.

With an extremely well-polished script, we could get more well-organized data, but with how quick this clean-up is, I'll sooner take the time that I would have spent fine-tuning my script to get me the data looking the way that it wanted, and spend it on some other aspect of my project. It would have certainly taken me longer to get this script to export the data perfectly than the 10 minutes this cleanup will take.

Getting Started

  • Download Notepad ++ if you do not have it already. It is a free text editor that has a lot of helpful tricks and can be used to write normal text notes as well as a variety of scripting formats. It also has lots of ways you can batch edit your document as you'll see now.  Unfortunately, it's not available for Mac but Atom or Sublime have similar functions
  • Locate the file you created: Comics_Scrape_Test.txt in Windows Explorer. Right-click on the file and choose to open it in Notepad ++
  • If you did not complete the last section and don't have this file, you can download it below.

Organizing your Text File

  • Turn on the Paragraph function (that backwards P sign at the top) of the window to get a more exact idea of how what we are looking at is formatted.
  • There are a lot of empty lines, and the data for each issue is in a jumble down the page after that issue's URL, not in a neat row following it.
    The results in a Notepad++ document, each name is separated by several lines.
  • Let's start by getting rid of all of those empty lines. Select everything in the file with Ctrl +A and then go to the Edit menu and choose Line Operations > Remove Empty Lines
  • All of the empty lines in between your data are now gone, but there is still a line break between each piece of data. Let's get rid of those line breaks(\r\n in formatting speak), and for now replace them with a tab, rendered as \in the code Notepad++ uses. The reason to use tab is because when you paste this into a spreadsheet program, that program will read the tab as a command to separate one block of text from the next by putting them in different cells, which is what you'll want it to do.
  • We can do this easily with Replace. When you're editing a whole bunch of text and want to make the same change over and over again, this function will be a huge aid to you, you'll find it in the Search Menu as Replace
  • Go to the bottom of the window that opens up, and make sure that Search Mode has the button selected next to Extended. That will mean that as it searches through your document, it's not just searching through characters but also formatting, like in this case the line breaks.
  • You have the Paragraph view on so you can see that there's a carriage return and line break at the end of each line. This is rendered as \r\n in formatting and for now you'll be replacing that whenever it appears with a tab or \t. Your Replace window should look like the below, and when it does, click on Replace All.
  • You'll see that now your document contains only one line with little arrows representing tabs between everything that was on separate lines before your search and replace. Scroll through it to make sure that nothing important is missing. Replace All lets you make huge changes to your document really fast, but that also means that you really want to catch any unintended side effects of these sweeping changes right after they occur instead of realizing they cause an issue many steps in, and having to start back over from the original file
  • Let's separate this block of text so that each line starts off with the URL for the page it contains the data from. The reason that you put a pipe | in the front of the url is because that is a character that is rarely, if ever, used. That way you can be certain that if you replace the pipe with something else you aren't damaging your data. In this case since you have it at the beginning of the URL, thus the beginning of a new page's data, you'll replace it with a line break or \r\n so that each web page's data can be on its own separate line.
  • Go to Search and Replace and make the option for what you want to find the pipe or | and what you want it replaced with to be \r\n.
  • Click on Replace All, and each page's data should now be its own line with tabs separating each name or designation (Writer, Artist etc) from the one before.
    Each line starts with a URL, then the roles, and the names associated with it, each separated with a tab.
  • Let's confirm that your data is now in a state that you can now essentially treat it as a spreadsheet by pasting it into a spreadsheet. Select everything on the page, and use Ctrl+C to copy it.
  • Open a new Google Sheets file in your Google drive. This next step will also work in Excel. Paste in your modified text file. It should automatically read each tab as an indication to create a new cell.
    The text in notepad++ has been mapped onto a Google sheets as an orderly table.

You've now created a script that takes the information that you want from a website, and with a little modification, turns it into an organized table for you. If you want to see what this text file looks like now, check the below file

This full lesson can be applicable to all kinds of webpages that you want to get information from. When you want to use Beautiful Soup to scrape a web page, you just need to:

  1. Locate where the information you want is in the web page's HTML
  2. Write a code to target that location and scrape the element that you want from it
  3. Confirm that it's giving you all the information you want and only the information that you want by printing the results to your console 
  4. Scale up your script so that it works for multiple pages
  5. Alter the code to print the information to a file, not just your console
  6. Confirm that you can turn that file into something that is usable in a spreadsheet or other useful format for you by cleaning your data

There can be roadblocks along the way, and not all sites are organized as straightforwardly as this one, but with these general steps, gathering your data from the web with Beautiful Soup can be a nice orderly process.