Skip to Main Content

DAsH

Research Guide for DAsH (or digital humanities) resources and tools

What is an API?

API is the abbreviation for application programming interface. Large companies or websites build these interfaces on top of the servers where they store their data so they can access and ask questions of that data quickly. Sometimes these groups make their interface public and sometimes they sell access to it. Publicly accessible APIs can be a great place for you to pull data from.

For example, if you wanted to search Twitter to see how many people were tweeting the hash tag #manhattanhenge on the date of this astronomical event every year, you could just search Twitter's regular advanced search interface, and then copy and paste all of the results manually. But if you wanted to do this efficiently, you could use Twitter's API to pull this information for you, and then give it to you in a format (usually JSON) that you can use to graph or otherwise analyze that information.

Some people even use those APIs to build their own programs, if they know a lot of coding languages. For instance, someone at a ticketing company could build a program that would display what the weather is supposed to be like on dates where tickets were sold for outdoor events. They would build a program that takes the zipcode of the event, the date and the time, and access the National Weather Service's API using that information to see what the weather will be on that location and date, then display that data to a user looking to buy tickets.

With these tutorials, you'll see how to access an API using a web interface, how to query it manually with a URL, how to use a Python script to querying it and learn how to use a Python wrapper or library that someone else built for a specific database. These exercises are designed for someone who is already familiar with the basics of Python, so for them to be useful to you, you'll need to already know the basics of how to write and understand Python code. You can learn basic Python for free at Codecademy or if you have a New York Public Library card, they have tutorials on Lynda for Python.

Querying APIs

Learning Goals

In some websites built on databases, they make their API available as a web interface. This makes it easier for someone who is less code-savvy to still get information from the site. An API is queried when someone feeds it a url containing the keywords or categories to be searched for as well as any other parameters necessary to narrow the search.

An API query looks something like the below, it consists of the API's URL, as well as additional information about what kind of search to do, what to search for and any parameters or limits that are added. 

https://marvel.fandom.com/api/v1/Articles/List?category=X-Men_members&limit=25

If there is a web interface, people querying that API don't need to know how to format an API query or how to automate that process.  With a web interface, we can just fill in some boxes and the site will do the rest for us. Sometimes you will still need a key, for this example you will not. For this exercise, we'll be using the Marvel fandom wiki to pull all the articles for comic book issues that came out in 2018. This is something I did while doing a large-scale project about comics, and in my case, I was grabbing these articles' URLs so I could then scrape information from the article pages. For more information about web-scraping, see the DAsH web-scraping tutorial..

Getting Started

  • Go to the website whose API we will be using - https://marvel.fandom.com.This is the front-end of the site or what you see as the user. How it is organized will tell us something about the kinds of categories and distinctions of different kinds of pages that the site uses to classify its contents. This is a fan-curated database for Marvel Comics, and contains articles written by fans on different characters, titles, issues, creators, all different aspects of the publisher.
  • We want to find all of the pages about comic book issues that were published in 2018, so let's see if there's any space where comics are separated by year. Go to Comics in the menu bar and choose All Comics.
  • This takes us to a page that has an index of comics by title and then a section by date. Click on 2018 since that's the one we want.
  • This takes us to an index, but with signage at the top saying that it is a category page
    An index of the titles in the 2018 category with thumbnails for the different comics of that year.
  • When we look at the URL, we see that it also has category in it.
    https://marvel.fandom.com/wiki/Category:2018
  • Go back to the previous page - All Comics, and click through a few different years. You'll see that this is consistent. Each year's link takes you to a landing page that has the category page label at top, and the URL lists Category:2018 or Category:2009 or whichever year's landing page you are at. Now we have an idea of how these articles are categorized by year, and can proceed to the API to see if we can search by this category.

Using the API

  • Open a new tab and go to the API's landing page -  https://marvel.fandom.com/api/v1. APIs will have a documentation page that explains how to start the URL for each search that you want to do, what parameters you can specify, and what kind of information you can expect to get back. I'm using this one as an example because it is formatted in a pretty user-friendly manner, and will let you try out different searches on the website itself.
  • Each of the different items on this list is a different kind of search you can do. The list items are formatted like /Activity and /Articles since that's how the URL for the query to the API will start off. if you click on any of the list items, you'll get more information about those searches.
  • The first option is /Activity, which has a blurb letting you know that it's about user activities. This might be interesting for you if you were doing a project about what kinds of pages got edited the most often. This isn't our eventual goal however.
  • The next one is /Articles, which the blurb tells us will give us article contents. That sounds like what we want, so let's click on the expand option on the side, and this takes us to another level of possible searches
    /Articles is highlighted, and below it are a list of other searches including article content, details, and articles list.
  • You'll see that each option has a different description of what the search will do at the top, and below that is part of a URL which lets you know how a query to that kind of search will start. We need to find the search that returns the information that we want (the url of the article) and takes the parameters that we want to search by (the category articles are in).
  • Let's click on the second option, Get details about one or more articles. When we do this, a new section opens up on the page, with further information about what we can do with that search.
    A list is given of the data that a search will return
    At the top, there is a section called Model. This is the information that a search will give us. It may look daunting, but to translate it vaguely
    • ExpandedArticleResultset - how many items it finds, and the base path
    • ExpandedArticle - a lot of information here, including the URL which is what we want to find for this article, the title, the URL for the image thumbnail
    • OriginalDimension - this relates to the thumbnail image. If you were writing a program that wanted to find the images associated with each article, you'd be interested in this aspect of the search results
    • Revision - this provides data about who revised the article
  • To the left of Model, is a tab that says Model Schema. Click on it and you'll see an example of how the query return will be structured.
    JSON code written below the heading Model Schema
    This is written in a language called JSON - JavaScript Object Notation. JSON is a way of organizing data by nesting it as different levels.  It looks intimidating, but for our purposes you'll be interested in only a few of these items which you can easily use Ctrl+F to find without fully having to understand how JSON is set up, promise. If you do want to get a bit more of an idea of how this language works, check out the W3School's tutorial on the topic.
  • Keep scrolling down, and you'll see that there are some boxes that are editable underneath the heading of Parameters. These are the parameters that the API will use to search by. Think of it being like the search terms you put into a search engine box, and any of the filters that you would check off about what kinds of entries you wanted. Like if you're searching in image.google.com and want to look for pictures of cats, but only over a certain image quality. This is the more direct version of doing that search.
  • The parameters that it accepts are ids - which is the article ID, titles, which would be the title of the article. There are also parameters for abstract, width and height which are more like limiters than the search term. This is a way for you to pick how much of a summary of what you want for the article, and how big you want the thumbnail to be. You won't get an actual image in your search results, but a URL that will lead you to one. 
  • Since we already know that what we actually want to search for is everything in the category of 2018, we know this isn't the option we want since it will only let us search by title or by article ID. Click the arrow to roll this search option back up, and let's try the next option down, Get articles list in alphabetical order.
  • When we check out the Model data for this option, we'll see that like the first option we tried will give us the url, which is what we ultimately want, as well as the article's id and title.
  • Let's keep scrolling, and fortunately, there is a space in the parameters where you can search by category. Type 2018 into that box. The other parameters are namespaces (which you don't need to worry about - you can leave blank), limit (more on that in a sec) and offset ( do not worry about it).
  • Limit is set to a default of 25. This means it will only give you the first 25 results. Now, what we really want is to get ALL of the articles that have that category of 2018 assigned to them, so let's make it a huge number of 10000
  • When your Parameter set up looks like the below, with 2018 in the box next to category, and 10000 in the box next to limit and the other boxes left blank, scroll down.
  • Below this is a section on Error Status Codes, but we can ignore this, it's just information on what the return will be if the query fails.
  • Speaking of the query, go ahead and click the button Try it out!, as this will make the query to the API that you've set up.
  • This opens another section onto the page, which gives you the results. First, there is the Request URL.
    https://marvel.fandom.com/api/v1/Articles/List?category=2018&limit=10000
    That is the exact way that the query was set up, and if you paste that into the browser window, it will give you the same results, just not in a box like it is below. Next to that is the Response Code, in this case 200, which means that it worked.
  • Scroll down, and there is the Response Body, which is the series of items that are your results. Each group of text within the same {} brackets is one item. It has that item's id, title, url,  and ns (namespace).
  • To get a better idea of what we actually have here, let's highlight all the text in the Response Body and copy it. 
  • In a new browser tab, navigate to http://www.convertcsv.com/json-to-csv.htm. This is a site that will take the JSON results that you got and convert them to a CSV file. A CSV file is sort of like a spreadsheet file, but it can be opened by any kind of spreadsheet program, Google, Microsoft, Libre Office, whichever. CSV stands for Comma Separated Values, and spreadsheet programs will read it to mean that whenever there is a comma, that means the division between one column and the next. This converter will take the JSON results you got and convert them to a table
  • Paste the JSON results that you got in the Response Body into the field marked Step 1. Click on the button that says Format JSON.
  • Scroll to the part of the site that says Step 3 and click on the button that says Convert JSON to CSV. This will take the JSON file that we pasted in above, and make it into a CSV file, and if you keep scrolling, you'll see what that looks like as a table.
    The JSON appears as a CSV file and a table.
  • At the top of that table, you can name this CSV file that you made Marvel_2018, and then download the result.  The URL column just has the end part of the url, a relative path, but when we put it in a program, we can easily just do a search and replace to add the base path for the site - https://marvel.fandom.com  to the front and get a working URL.

You've just used this website's API interface to get a list of all of the URLs for the comics that came out in 2018. Next we'll see how to make a similar query without having to go through the full interface to do so.

Manually Querying an API

  • When you did your query using the site's interface, one of the things that it returned to you was the request URL, or the actual query that was sent to the API, it was the below:
    https://marvel.fandom.com/api/v1/Articles/List?category=2018&limit=10000
  • Copy that URL and paste it into a new tab in your browser, then press enter to go to that address.
  • You'll see that the same data that came up in Response Body when you ran your search previously is now coming up as plain text within the window.
    The full JSON code in a browser window.
  • So if you wanted to change that to be a search for the 2017 comics instead of 2018, you could just change the URL to instead be
    https://marvel.fandom.com/api/v1/Articles/List?category=2017&limit=10000
  • Change the URL in your browser to this, and press enter to do that search. You'll see all new data pop up
  • Go ahead and paste this into the CSV converter in the last steps of the last section and you'll see that now the titles and URLs involved are different.
  • For websites like this where there is an interface, you can do a trial search to construct a sample query, and then take the URL it created, change a few parameters, and do a new search.
  • Knowing how to structure this URL is also key to writing a Python script that will do multiple API queries for you. You can always test a URL by pasting it into a browser window, and seeing if it works to give you information.

In the next section, we'll explore how to use Python to construct an API query. We'll see how using Python will allow you to only pull certain parts of the JSON results, such as a title and a url, and leave off extra information. This can lead you to having data that is easier to use.

Learning Goals

In the last exercise, we learned how to use an API that had a built-in interface, but not all are that user friendly. Sometimes you'll have to manually create your own query URL to feed into the API. You can do this right in the address bar of your browser if you want to, like we saw in the end of the last section, but you can also do this using Python. One of the advantages of using Python to do this is you can automate the process, which can mean making a sample script that you can plug multiple searches that you want to make into, can mean making multiple queries in succession, and can mean writing a script that will store the results of your query as a nice organized table. We'll be using the Marvel fandom wikia's API again for this exercise and using it to pull a list of female character's names and article URLs to save to a file.

Getting Started

  • Make sure that you have Anaconda for Python 2.7 installed on your computer.  The kind of process we are going to use also will work in Python 3 with a few changes, but Python 2.7 is a bit easier to explain and so will be used in this tutorial. We'll be using Jupyter Notebook from the Anaconda Python platform throughout this lesson, so you can install both that and Python by downloading it on on Anaconda's website.  Choose the 2.7 version.
  • If Anaconda has been installed correctly, you can find Jupyter Notebook in your start menu if you have Windows or Finder if you're on a Mac.  You can also start it by opening your command prompt terminal (search for Command Prompt) in Windows and typing in
    jupyter notebook   which will open Jupyter Notebook in your default browser.
  • Jupyter Notebook is an application that lets you create code in an environment where you can execute it easily and save your results, unlike if you were writing your code either directly in a command prompt window, or just saving it as a .py file which would include the code but not the results. For this reason it's ideal for tutorial situations like this. If you're more comfortable in another program, feel free to use that but this tutorial will assume that you are working in Jupyter Notebook.
  • When you have started up your copy of Jupyter Notebook, select the New  dropdown in the upper right corner and choose Notebooks -> Python 2.
  • This will open up a new Notebook and have a blank console ready for you to type in information. It's important to remember that you press enter to give yourself a new line in the text interface. You press shift-enter to execute the code.
  • Eventually we'll be editing the information that we have saved off the site to a text file. For this we'll be using the program Notepad ++. Download Notepad ++ if you do not have it already. It is a free text editor that has a lot of helpful tricks and can be used to write normal text notes as well as a variety of scripting formats. It also has lots of ways you can batch edit your document as you'll see now. Unfortunately, it's not available for Mac but Atom or Sublime have similar functions
  • In another tab, go to the Marvel fandom website. In the last exercise, we found that there was a separate category for issues released in each year. Now in this exercise, we want to see if there's somewhere on the site where all of the female characters have been put into a single category. This is similar to something that came in handy when I was working on a large-scale comic book project and wanted to see who an issue was about. 
  • At the menu bar on the top there is a category called Characters. Click on that link.
  • Scroll through the option and click on the link for Category: Characters by Gender
  • On that page, there is a link to Category: Female Characters
  • Look at the URL for that page and see that it is 
    https://marvel.fandom.com/wiki/Category:Female_Characters
  • In our previous exercise, we were able to use the term after Category: in the url as the category that we were searching using the API, so let's see if we can use Female_Characters the same way that we used 2018, to find all the pages in that category.

Confirming the API's Query Structure

To start with, let's go back to the API documentation and make sure that we know all the terms used with the parameter and base URLs for the query

  • Return to the API's landing page - https://marvel.fandom.com/api/v1
  • Go to the /Articles option, and click to expand the options available to you off of this kind of query. As mentioned in the last tutorial, we are looking for a list of articles that fall within a certain category.
  • Click on the option for Get articles list in alphabetical order, and confirm that this option will let us put in a value to search for in a certain category by scrolling to the Parameters section.
  • Check the Model to confirm that the article title and  url as part of the results, since that's the information that you want to get out of this API query, and we'll see that they are both listed in the model
  • Because we have the interactive interface for this API let's do a trial run for the query that we want to do, and put Female_Characters in the value field for category, leave the limit as 25, then click on Try it out!
  • We'll get some results, but really what we'll find the most useful is the URL that it displays for us of what the request is that got us these results
    https://marvel.fandom.com/api/v1/Articles/List?category=Female_Characters&limit=25
  • We can use the model of this Request URL as a basis for making our request to this API with Python
  • In our Jupyter notebook, we'll now substitute each of the components of the URL with variables, and get the results printed to our console in a much neater and more organized way than the JSON response that we get within the interface of the page. We'll also expand the limit for the amount of responses that we are looking for. 

Querying an API with Python

  • Let's start off the script by importing the modules that we'll need.  The first module, requests, will let us make a query to an API and the second, json, will let us parse the JSON results that we get in response.
    import requests
    import
    json
  • With this script we'll be recreating the request URL that we used to query Marvel Fandom's API, but using variables. Using variables means your code is more easily customizable. You could write in a bunch of different variables at the beginning of the code, swap them easily into the code each time you execute it and get different results without having to rewrite everything. Let's start with the url that is at the beginning of the request and call it base_article_url
    base_article_url = "https://marvel.fandom.com/api/v1/Articles/List?"
  • We need to add in the values that we plugged in as our parameters for the query. We want to look for articles in the category of Female_Characters, and lets increase the limit to 1000. We'll create variables for them.
    ladies = "Female_Characters"
    howMany = 1000
  • You'll combine those into a variable payload. The payload is what you use to designate the parameters that are fed in when you make your API request. If you wanted to swap in a bunch of different categories, one after the other, the fact that you made the category a variable instead of a string will make this automation easier.
    payload = {'category':ladies,
              'limit':howMany}
  • We'll use the get function of the requests module to query the API with requests.get,  specify that we want it to start with the base_article_url that we set, and then feed in the payload variable that we set up as our parameters. r is the return that is received when the API is queried.
    r = requests.get(base_article_url, params=payload)
  • Just to make sure we have it right, print the url that you sent to requests which with the below comment, and execute the code.
    print r.url
  • All you should get in return is the API request URL (because its all you told it to print is the url). It should match the same URL that you sent with the interactive API, just with a higher limit set.
    https://marvel.fandom.com/api/v1/Articles/List?category=Female_Characters&limit=1000
  • If you try to print the variable you've stored the request in, r, all you'll get is the response (200, showing that the query went through), but not the results.
    print r
    <Response [200]>
  • Convert the data that you got in your response to be read as JSON with the below command from the json module. Then print the results
    data = r.json()
    print data
  • It's important to note at this point that you already have all the API results stored to your computer in the variable r. This means you could continue to use this even if offline, and that you are no longer making requests of the API, unless you use requests.get to query it again
  • You'll see the JSON results of your query but it's rather garbled and not all that neatly formatted. It's actually a group of nested dictionaries, and you can parse through it like you would any other nested dictionary with Python, you just need to know how the nesting is structured. For more on Python dictionaries, take a look at the documentation at W3 here.
  • Go back to the API documentation to try to get a better handle on how to parse through the return that you are getting
    The nesting structure of a response to an API request
  • You'll see that everything in the results is housed within a list called items and that within each of those, you have 4 different attributes, an id, title, a url and ns (which means namespace).
  • Let's drill down one level, accessing it as a dictionary.
    print data['items']
  • It prints out the full JSON response again, but now it goes down that one extra level and starts with the first item within that list
  • But we'll want to print out each of our items individually, not as a large mass. We'll also want to format the information available for each item, picking some things and leaving out others. Let's try to print out just one of the entries in this dictionary, the level below items. We'll move down another level with an index, to print out this item here, by using the script to ask for the 2nd indexed dictionary within items.
  • Use the variable test, then ask for the 2nd indexed item in data. Note that this will find the 3rd item in the list, because the index starts at 0. Then print the variable you've created.
    test = data['items'][2]
    print test
    The JSON for the 2nd indexed item printed to the console
  • This returns the full JSON entry that is the 2nd indexed entry under items. It has both the key and value for the entry for the url, the id, the title and ns. This is still a bit messy for us, especially since to start we were just looking for the names and the URL. Fortunately, looking at the JSON, it looks like the character name is the title of the page.  Let's go into that 2nd indexed object, and go down one more level and say that we just want the value that is stored with the key of title.
  • The below code will do this. Putting in the brackets indicates that we are moving down that extra level in the dictionary. So the below indicates that we want it to look into our variable data, go down one level into the information that is stored in items, looking for the entry indexed 2, and then within that second one, to find us the value next to title:. Once we do this, let's print the results to be sure we have the right information.
    name = data['items'][2]['title']
    print name
  • With that, we'll get a print out of the title listed for the 2nd indexed item.
    the title indicated above is now printed to the console
  • Now that we know how to target the data we want within each item in the JSON, let's scale up. The point of this kind of short cut is to gather data at a larger scale, and get all the names in your results. In order to do this we'd basically just want to replace that number in the middle, that tells the script which indexed item it wants out of items in data. We can do this with a for loop and a range() function.  
  • A for loop is a function in Python that lets us tell our script to repeat the same function on a group of objects in a series. These objects could be strings, like for instance if we wanted it to look up a group of locations and return back their latitude and longitude. They can also be numbers, and in that case, we can use the range() function to tell us what numbers we want to sub in for a variable as a range from say 5-10 or 0-50. That way we wouldn't have to tell it in separate lines of code to look at 0, then 1, then 2, etc. If you just want to specify an end point, you'd put that number in the parenthesis after range. There are other things you can do within range(), and for more tips with that, check out the range() function tutorial on w3schools.
  • Mostly when people do this, they use i as the local variable in the for loop, probably because it's easy to see that i means integer. Let's set up the for loop with i as the local variable, or the stand in for the rest of our code for what task we want performed with that object, and range(50) as the sequence that we want the for loop to iterate through.
    for i in range(50):
        names = data['items'][i]['title']
        print names
  • This will print out the first 50 names from the results that you got.
  • In addition to the names from our search, we also want the URL that goes to the page about them. But when we look at the JSON results, we see that it isn't formatted like http://... but just starts with /wiki/115_(Legion_Personality)_(Earth-616).  Obviously even though that line is listed as url in your results, that address won't work if you just plug it into the address bar of your browser. That's because this API stores the urls as relative paths, not absolute ones. This helps with storage space on the website, since there will be just a few lines in the HTML to connect the base path to the front of each relative url - https://marvel.fandom.com - and voila, it will connect.
  • We can also add this path to the front of the URL when we use the code to locate it in our JSON results and print it. However, let's first make sure that we can locate and print the entry for url for the item we've been working on . The same as we did for title, we are asking the code to look in the data variable, go down to the items level, then within that grouping go to the item indexed 2, and locate the value stored with the key url.
    print data['items'][2]['url']
  • Executing this code yields the url, but it starts with /wiki/, so wouldn't function in a real browser.
  • I mention the base path above - https://marvel.fandom.com  without really explaining where I got it. If you're dealing with an API that gives you just the relative part of the URL, you'll have to figure out what you need to put in front. In this instance, here are some places that it shows up
    • On the initial category page where we found out that the category was Female_Characters, you can click on a few of the results of the page, and take note of the URLs of the results to confirm they all start out that same way
    • If that isn't where you originally entered the site, you can start from your JSON results, and use the search bar to locate some of the articles in the results, and then see how the URL for those articles starts off.
    • Lastly, at the very bottom of the JSON results that you first printed to the console, you'll see that the entry right after all of the items ends is for 'basepath' and lists that URL.
      The basepath within the JSON
  • In Python, we can fortunately pair strings (or text) and the results of variables together. So we can tell it to print out both the url that comes from the item's entry in the JSON dictionary, and the base path that goes in front of it with the below code. Let's store it as the variable fullURL and make that variable the base path that we discovered "https://marvel.fandom.com"  connected to our item data['items'][2]['url'] with a plus sign. Print the result to make sure we have it right, and then execute the code.
    fullURL = "https://marvel.fandom.com" + data['items'][2]['url']
    print
    fullURL
  • The results will be the base path combined with the url from the item. - https://marvel.fandom.com/wiki/115_(Legion_Personality)_(Earth-616)
  • Copy the URL and place it in the address bar in a new tab in your browser window. If you did everything correctly, this should take you to the article associated with this item.
  • Time to scale it up with the same automation that we did with the names. We'll put a for loop ahead of the code to create the fullURL variable, and then replace the [2] index place holder with the [i] variable.
    for i in range(50):
        fullURL = "https://marvel.fandom.com" + data['items'][i]['url']
        print
    fullURL
  • This prints a full list of URLs for that whole range without a problem. So we know that we can find the url, alter it to work for our purposes, and then do the same task at scale.
    the script above and the resulting list of urls from it
  • Now that we have two different scripts that find the data that we want, let's combine them and get the data for all of the results that we found with our original API query. We'll change the range(50) function to range(10000) in our for loop since that's how high we set the limit. We'll also change that final print function to combine both of the items that we want printed, separated by a comma.
    for i in range(10000):
        name = data['items'][i]['title']
        fullURL = "https://marvel.fandom.com" + data['items'][i]['url']
        print name + ", " +
    fullURL
  • When you execute this code, you'll get the entire list printed to the console, but it stops at names beginning with M. Maybe there are more than 10,000 names in the database. You could count and see, but that would take forever. Instead let's run the code but instead of just printing the variables for the name and fullURL, let's add the i variable that tells us what number the item is on the list.
    for i in range(10000):
        names = data['items'][i]['title']
        fullURLs = "https://marvel.fandom.com" + data['items'][i]['url']
        print i + "," + names + ", " +
    fullURLs
  • Alas, it's not that easy. You get an error message telling you that you're trying to combine int (an integer) and str (a string) in the same operation, which Python does not like.
    Console has error - TypeError: unsupported operand type(s) for +: 'int' and 'str'
  • We can just convert the i variable to be read as a string with the str() function, which will convert whatever parameter that you put in it, into a string. Let's try the code again with that little change.
    for i in range(10000):
        name = data['items'][i]['title']
        fullURL = "https://marvel.fandom.com" + data['items'][i]['url']
        print str(i) + "," + name + ", " + fullURL
  • Now you'll see that it prints out all of the same information it did before, but also tells you the index number for each entry. By doing this you can see that the script did indeed go to the end of the whole JSON results, and that if we did want to get all of the characters indexed in the API, we'll have to set the limit higher for how many results we want to get.
    The results printed to the console with their index number printed at the beginning of it.

We've used Python to query an API, and then to parse through the results that we get back to turn them into this nice orderly list. But. this information now only lives within the console. If the program restarts, you would lose it. Additionally, if you wanted to take this information and tell another program to look at it, like say, a text-analysis program so you could see which names were the most common, you wouldn't be able to do that if it weren't its own separate file, like a text or csv or xls file. Next up, we'll take these results and write them to a file, so we have a separate copy of all the information that we have found.

Writing a Python Script to Save the API Results

The above script works great if we are just looking to get quick answers to a console from an API. But the next step if we want to use this wide swath of information to continue our research project is to save it to a file. Next up, we'll be altering the script that we wrote, so that it doesn't just print the organized results of our API query to the console, but also writes them to a file.

  • We'll start out the same way as we did previously, importing the modules that we will need Python to use functions from, and adding the base url that starts off our API query as a variable, base_article_url.
    import requests
    import json
    base_article_url = "https://marvel.fandom.com/api/v1/Articles/List?"
  • Here is where we will begin the changes from the script we previously used. We'll be adding a new variable as a stand-in for the file that  we'll be having Python write your data to. We'll use the open function to create (and later re-open if we want) the file API_response.txt. You can name it something different if  you want, the important thing is that the file name have no spaces and end with .txt. We'll also chose that we want to append  -  "a"  -  the information to this file. Sometimes you'd want to use "w" for write instead, but with append, that means every time you run the program it will just add more information to the file, whereas with write, it would delete everything already on the file, and rewrite it entirely. We'll be using this variable f later whenever we want the program to add something to the file.
    f = open("API_response.txt", "a")
  • Proceed with the rest of the script  as we did before.We're assigning variables to the information that we want included in the API query, and then making the request to the API with the get function.
    ladies = "Female_Characters"
    howMany = 10000

    payload = {'category':ladies,
              'limit':howMany}

    r = requests.get(base_article_url, params=payload)
  • We'll print the URL used to make the request, and convert that new data that we received to be read as json formatting.
    print r.url
    data = r.json()
  • When we execute this code, we should just get the URL which will look like this: https://marvel.fandom.com/api/v1/Articles/List?category=Female_Characters&limit=10000
  • All is on track like last time, so let's start that for loop to iterate through each of the different results and print out the title and urls of each article. We can copy in the same script as last time but will make a few key modifications. For the first change, add a pipe or | to the beginning of what you want to have printed to your console. In the console, each item is placed on its own line, but when it is written to the file it will be much more jumbled. Fortunately, with Notepad++, we'll be able to replace each instance of | with a line break, and put each item on its own line that way. The | character doesn't get used in this document other than when we would put it there, so we can be sure that we are not interfering with the parts of the data that we want.
    for i in range(10000):
        name = data['items'][i]['title']
        fullURL = "https://marvel.fandom.com" + data['items'][i]['url']
        print "|" + str(i) + "," + name + ", " + fullURL   
  • Next, start a new line with f.write(). This will write anything that you put in the parenthesis to your file. Paste in the same thing that you were having printed to your console between the parenthesis.
    f.write( "|" + str(i) + "," + name + ", " + fullURL)
  • Finally, at the very bottom of the code, outside of the for loop, you'll close the file that you've created with f.close. Execute this code.
    f.close

The console fills up with  the same information that you were seeing before, but as you scroll you'll see that it runs into an error message after item 196. 
The results in the console. They are cut off at a UnicodeEncodeError with an explanation that the ascii codec can't encode a certain character.

The only major thing we really changed versus last time is writing our results to a file instead of just printing them to the console. When we just printed them to the console earlier, it went all the way to the end of the list.  Let's try opening the file to see if there's anything that can tell us what went wrong there.

  • With Explorer or Finder navigate to the same folder that your Jupyter notebook opens onto. In my case, it's my Users folder in my C: drive
  • Locate the file you just created, API_response.txt, right-click on it and choose to Open With > Notepad ++
  • When you open the file you'll see that the API results are indeed all on one line
  • Notepad++ has great find and replace capabilities that you can use to batch edit text in your document. We are going to use those capabilities now to take that | character that we added to what we wanted printed and replace it with a line break. Once we do this we'll be better able to see the results that we got, and can better diagnose where the problems are in our code.
  • Go up to the Search menu option at the top and choose Replace. You can also do this with Ctrl +H.
  • Make sure that Extended is selected in the Search Mode box. This means that when it is searching for matches and replacing text in the document it isn't just concerned with the plain text but also with the documents formatting
  • In the box next to Find what: enter in | , and in the box next to Replace with: put in \r\n. \r\n is the formatting code for a carriage return and line break. When you finish, select Replace all.
  • Now your document has a different line set up for each of the results that you got, and if you scroll, it ends with item 195.
    Each line is an API result starting with an index number, then the name, then a URL
  • However, our results in the console lasted until 196 before it gave us a unicode error. Item 196 had an accent mark over the e, which seems to mean that our code will print that accent mark to the console, but will have an error when it tries to write it to the file.
  • We have a couple of options at this point. One, we could do a bunch of Googling to figure out what to do when this error comes up and use trial and error to implement these solutions in the code. Or two, we could see if we can have our code skip over these exceptions and then continue through the rest of the results using a try and except block.
  • I will opt for the latter, in part because it's important you know this formatting. Sometimes due to irregularities with the materials that you are dealing with in Python, you won't be able to come up with instructions for your code for every given scenario that might be thrown at it, and you'll need to use try and except to tell it to just move on if it comes to an error.
    The perfect is the enemy of the good in coding, as is explained in this wonderful XKCD cartoon.
    A humourous cartoon pointing out that in theory once you write an automating script it will do your task for you, but in reality, the troubleshooting and debugging will take more time than the original task.
    Credit: https://xkcd.com/1319/
  • The way our code is written now, we have given it very literal and limited instructions for what it is supposed to do. Whenever something happens that prevents it from doing what it was told to do, or if it encounters something that it doesn't have the instructions it needs to know what to do with, it will stop and tell you what the error is. A try and except block is a way to add some flexibility to the code, to tell it that when it finds something that is wrong, to do some different group of activities, and then resume the code. You get to tell it - "Try to do this first bit of code, except if that code doesn't work, and then do this other bit of code."

Let's revise this with a try and except block for the part about writing to the file. This way the script can continue through the rest of the results when it runs into an issue with one of the individual items we got from the API.

  • Go back up to the top of your code and change the f variable to go to API_response_revised.txt so that you can compare this new file to your first one.
    f = open("API_response_revision.txt", "a")
  • We'll start off the for loop the same up to and including the line where we have the print command. Remember it did print out all 10,000 names when we tried the code before so we know that the problem is not until we write it to the file.
    for i in range(10000):
        name = data['items'][i]['title']
        fullURL = "https://marvel.fandom.com" + data['items'][i]['url']
        print " |" + str(i) +","+  name + ", " + fullURL
       
  • Here is where we start the try and except block. After your print function, set up a new line that just says try:
    Then press enter and indent the command that we already have in here to write the same stuff to the f file that is being printed to the console
    try:
            f.write(
    " |" + str(i) +","+  names + ", " + fullURLs)
  • Next, we'll put in the except statement that will tell the code what to do if it cannot execute that first command. We're going to tell it to write the | character to the file so we can still make this entry a new line, the index number i so that we know which result we are missing data from, and then the string ", bad text", so we have something to search for in our document to see how many of this error that we have to deal with. If it's a huge amount we may decide to go back and fix whatever part of the code is causing it not to write our results correctly to the file if they have an accent mark.
    except:
            f.write( "|"+  str(i) + ", bad text")
  • Leave the concluding statement that shuts the file that the entries are being written to.
    f.close
  • Your full code should look like the below, and when it does, execute it.

    import requests
    import json
    base_article_url = "https://marvel.fandom.com/api/v1/Articles/List?"
    f = open("API_response_revision.txt", "a")
    ladies = "Female_Characters"
    howMany = 10000
    payload = {'category':ladies,
              'limit':howMany}
    r = requests.get(base_article_url, params=payload)
    print r.url
    data = r.json()

    for i in range(10000):
        name = data['items'][i]['title']
        fullURL = "https://marvel.fandom.com" + data['items'][i]['url']
        print " |" + str(i) +","+  name + ", " + fullURL
        try:
            f.write(" |" + str(i) +","+  name + ", " + fullURL)
        except:
            f.write( "|"+  str(i) + ", bad text")
    f.close

  • When we execute this version with a try and except statement the for loop keeps going until it reaches the end of the 10000 range that we set it to, because we told it what to do when it couldn't write the name and url to the file like we wanted it to. 

  • Navigate to your new file - API_response_revision.txt in Explorer or Finder and right-click on it. Choose Open With > Notepad++ (or your chosen text editor for Mac).

  • Do the same find and replace that you did before, telling it to find and replace it with \r\n.

  • Our file now goes almost to the same extent as the console (the last 5 are left off for some reason on mine). Let's see how it wrote out number 196 that we had that trouble with earlier. Press Ctrl+F and search for 196,

  • Now in the text file, that item is listed as 196, bad text

  • Because you printed your results to the console, you could just replace this one manually by cutting and pasting what it has in the console, into your text file.
    196 is an error in the text file, but had a full entry in the console. So the console text can be cut and pasted into the text file.

  • Before we decide that we're just going to do this all by hand, let's see how many times this error pops up in our file. If it's happening a significant chunk of the time, we might decide to do some more fidgeting with our code to make sure this error doesn't pop up frequently. Do a search in your API_response_revision.txt  file for , bad text and click on Find all in Current document.

  • The window that opens below shows how many times that phrase appears (33) and which lines it appears on. For me 33 things I need to fix by hand with a simple cut and paste out of 10,000 results is efficient enough, but if that number were a few hundred, then I might start poking around on Stackexchange to figure out how to get my code to write accent marks into a text file.

With this section of the tutorial, you can now use Python to query an API, print those results to a console, parse through those results to pull just the data that you want, and then finally write that data to a text file that you can then easily edit for use with the next phase of your project. Keep in mind, this particular API was very user friendly. It had an interactive version you could try out on the website. It didn't require you to get a key. Finally, it let you make queries of it directly with Python. Not all APIs that you want to get data from will be that straightforward.

In the next tab, we'll use a different API, Comicvine, that requires you to use what's called a wrapper to make queries to it with code, and requires you to get a key to make those queries.

Below is the Jupyter notebook file that I used for this section of the tutorial, if you want to see how I did it.

Learning Goals

In the previous sections of this tutorial, you learned how to use Python to query an API, sort through the results, and write those to a file. However, not all APIs are set up to take Python calls. In this case, you'll be looking to access an API that doesn't accept requests from Python, so you'll be using a wrapper that someone created for this API, so that you can write Python code on your end, and it will convert it to a query that the API will accept. Incidentally, this API also requires a key, so you'll find out how to query APIs that need you to get one of those.

In this section, we'll be working with the API for the website Comicvine in order to be able to use the site to look up a long list of writers and artists and get a list of those creatives' genders.

We'll get an API key from Comicvine, and do a couple of manual queries of the program to ensure that it has the information that we need. After that, we'll write a code using a Python wrapper called pycomicvine written by users of the database website, since querying it directly with Python doesn't work. As far as I can tell from commentary on their message boards, this is due to concerns about excess scraping slowing down their site. We'll be installing the code from github, then using the pycomicvine wrapper to feed the API a list of names and the our code will look up those people in the API and tell us their gender.

Getting Started

  • The first thing that you'll want to do if you are looking to use a site's API to harvest their data is make sure that they collect the data that you want to use. After all if Comicvine collected artist and writer's name and birthdates, but didn't have any standardized place or way they recorded their gender, then I'd have to look for that information somewhere that did store that data on people in some kind of standardized fashion. Let's go to the main site at: https://comicvine.gamespot.com/ and use the search bar to look for a comic book writer whose gender I already know: Gail Simone, who is a woman
  • She comes in up in the search suggestions in the category of Person. Select the link to the page to see if the data that you want is there.
  • Reading the page as humans, we get the answer from the headlining paragraph and its use of "she" when summarizing Simone's career. However, we want to set up a script to see if we can have a machine read a whole group of these pages,and so there needs to be some place that information is stored in the same consistent way across every page of its type. Fortunately, as we scroll down on her page, there is a box marked General Information, with a group of different properties of information that Comicvine collects for each Person and what categories or values it has matched with those properties for them.  We'll see fortunately for our project that gender is one of the properties that the database has decided to record information on.
    The general information box for Gail Simone
  • It looks like this is recorded in a somewhat standardized way, since the box is set apart from the rest of the page. Try clicking on some of the other names highlighted in her summary (Grant Morrison, Nicola Scott, Geoff Johns) and confirm that their pages also have the same looking General Information box with the same uniform properties recorded as well. That makes it seem like everyone with a Person category page will also have this same box with the same information listed.
  • Let's go to the API's documentation page to try and confirm that this is truly a property that we can request information on for any article on the site in the Person category. Just because a site has a standard way of listing data, doesn't mean that it will allow any user to look up that data.
    https://comicvine.gamespot.com/api/
  • At this API's documentation page, you'll need to get yourself an API key. This key will need to be included in all your code, and it's your unique key that will let you into the database. Some sites have keys so they can better monitor how much data each person is scraping, and see what kind of activity is being undertaken.
  • Click on Sign Up, and after you put in a bit of information, including an email, you'll receive your key via email. The API key will be a long  string of letters and numbers. Save or star that email and/or save your key where you'll be able to find it.
  • Now log into the page with your new username and password and then click on the link to List of Resources that will take you here:
    https://comicvine.gamespot.com/api/documentation. This is where you will find out what categories of searches you can do with the API and what kind of information will be available for you.

Getting Familiar with the API

  • Scroll through the documentation, to see what sort of actions and queries we can make with this API. From your previous look at the site, we know that the article for the sample creator we picked is on a page in the category of person and that whether they were male or female was in the General Information box as Gender. Click on the right hand side of the page to proceed to searches that can be made for a person article.
    The options for kinds of pages that can be searched, and what can searched for within them.
  • There are two categories of information in here, Filters which is the parameters that can be used in the search and the Fields category which is the information that you'll receive when you make a search.
  • Under person, we see that gender is included as one of the values that will appear when a search is made as is name which will allow us to connect those two categories about the creators that we are searching out, however it isn't clear in the Filters column how you can make that search
  • Scroll down to the documentation for the search term for people, and you'll see that the filters section is a lot larger and will let you filter by any of the different fields.
    The filters that can be applied for the category of people and what they entail. Name is one of the options.
  • Since we can filter by name with the people search option, that will be the one that you use to make your API query with.

Constructing the API Query

  • Let's use a plain text editor like Notepad++ and use it to construct our query which is the URL that we'll send to the API in order to get our desired response.
  • Start with the below that will point the query at the API action that you want. The one that searches for information on people.
    https://comicvine.gamespot.com/api/people/?
  • After the ? add in your api key formatted as follows, substituting API_KEY_GOES_HERE with your key, and then separating it from the next bit with an ampersand (&)
    api_key=API_KEY_GOES_HERE&
  • We'll be doing a filter search for the name of the person we are interested in, by starting with filter=, then choosing the filter we want to use name: then after the colon, putting in the name that we want Gail Simone . Substitute a  %20 (the html code for a space) for the space. Separate the query from the next section of the url with an ampersand
    filter=name:Gail%20Simone&
  • For the last part, we'll specify the format of the return we want with
    format=json
  • Copy the URL that you made, place it in a new tab's address bar, and press enter. It should look like this but with your API key where it says API_KEY_GOES_HERE. Your browser screen should fill with JSON script. This is similar to the category search that we made in the last section, we just added in the use of an API key.
    https://comicvine.gamespot.com/api/people/?api_key=API_KEY_GOES_HERE&filter=name:Gail%20Simone&format=json
  • The JSON script is pretty dense but if you use ctrl+F to find the word gender, you'll see that it's in there, listed as 2. Often codes will use standardized abbreviated behavior, so you can make the educated guess that 2 means female.
    The JSON results, it does contain information on gender.
  • To see this a little clearer, let's take this script, and put it into the JSON to CSV converter that we used in the first part of this tutorial at:
    http://www.convertcsv.com/json-to-csv.htm
  • When we do this, we can see this as a single row with many columns, and we can scroll over to the one that says gender and see that is marked 2
    The CSV table version of the JSON results.
  • The documentation for people said that the options were Male, Female, Other. Considering that Gail Simone, a woman got 2, could this mean that the coding for these attributes are Male (1), Female (2), with Other as something else?
  • To test this,  let's swap out Gail's name with a different name for a male comic book writer, and try this URL:
    https://comicvine.gamespot.com/api/people/?api_key=API_KEY_GOES_HERE&filter=name:Warren%20Ellis&format=json
  • A new crop of JSON pops up in the screen. Do a search for gender. Within the code, gender is marked as 1. We can run through this for several other creators whose gender we know to test this theory,  but for now we'll proceed with the assumption that 1 is used for  male personnel and 2 for female.

After going through these steps it seems that this resource works to get the information that you need, so time to automate it, right? Unfortunately when I went through the same steps as the last section to automate the process of requesting an API, I found my request blocked, even when the URL that I had generated worked when I pasted that back in the browser's URL bar.

At most levels of DAsH coding or other programming-related work, you are going to run into issues, and any kind of code work is just as much about knowing how to ask for and look for help as it is about skills and tricks that you already know. So I went onto the developer forum for the API on Comicvine, and saw that other people had experienced this issue, and one of the answers on the forum contained a link to a Python wrapper called pycomicvine. It has modules that lets users use functions to better automate the process of searching the API. So I decided to download and use that wrapper.

Installing the Pycomicvine Wrapper

A Python wrapper (sometimes called decorator)  is when someone creates a series of modules to simplify the code the user has to write on their end to get their desired result. The Comicvine API was tricky to use with Python, so someone wrote a wrapper for it that will let the user use simple commands, which will access functions within the wrapper that jump through a bunch of hoops to query the API in the way that the user wanted. It's kind of like a mediator between the user and the API. This wrapper package exists on a site called Github where people post code that they have written to publicly share it so that other people can use it (and test it too).

  • Go to the Github site at:
    https://github.com/miri64/pycomicvine
  • Scroll down through the ReadMe to see the usage, you can see that it uses the same filters as the regular API and there are a bunch of sample use cases in there. This is the documentation you'll be looking at when figuring out how to use the code, but for now, you're just looking at it before you bother downloading and installing to make sure that it does what you need the code to do which is allow you to type in a few commands on your end and then do the API query that you want.
  • Download the zip file that is the code by clicking on the Clone or download link and choosing Download Zip
  •  When the program finishes downloading, move the pycomicvine folder that is in the zip file into the same folder that your Jupyter notebook opens onto when you start up the program, or the Home folder for Jupyter notebook. This is so your Python will be sure to find the module when you try to use it to send requests to the API.
  • Go into the pycomicvine-master folder and open a new notebook.
  • Execute
    import pycomicvine

If this works without sparking an error, then that means that you can start using this wrapper. If not, make sure that you have your notebook open in the right folder in Pycomicvine. The module should be able to just be installed and usable from anywhere on your computer, but different machines depending on their setup can be kind of buggy about how they deal with importing modules that were added by the user.

Using the Comicvine Package

In this section, you'll use the pycomicvine package to query the Comicvine database to find different comics creatives and their gender.

  • Let's start by  looking at the ReadMe section of the GitHub site for the package we just downloaded. When people upload a code package onto GitHub, they'll also create a ReadMe which can vary in length and thoroughness, but gives an overview about how to format code from the package and what its capabilities are
  • After information about how to install, the Usage section of the ReadMe first explains the crucial component of how to import the module and how to add your API key to your queries.
    import pycomicvine
    pycomicvine.api_key = "YOUR_API_KEY_HERE"
  • The documentation explains that you can use the same filter=name: function as you did when you did the manual code, and you can also use an article ID to search. When you're searching a list resource (i.e. Objects, Characters), you can search by name. When you are searching a singular resource (i.e. Object, Character), you'll need that resource's ID.
    Instructions on the Github about how to do queries
  • There is also sample dialog on how to move down levels in the JSON to get the data stored within each item, the name, volume, etc. This will come in handy, as we eventually want to do a search within the People resource for the writer or artist we are looking for, and then find out their gender.
    Example of defining the issue variable to go with an ID, and then using in the built in commands of, for example .name to get the name
  • There's also instructions on how to do each different kind of search (for People, Character, Issue, Object, etc) by using different classes.
  • From the examples in this ReadMe let's construct our own queries. We'll start with importing pycomicvine and adding in the API key
    import pycomicvine
    pycomicvine.api_key = "YOUR_API_KEY_HERE"
  • Let's first do a search for the character Harley Quinn. We'll start using the Characters resource, because we are trying to filter through a list
    pycomicvine.Characters(filter='name:Harley Quinn')
  • This will return this object
    Characters[<Character: Harley Quinn [1696]>]
  • Now that we have the ID for this article, we can do a more specific search with the Character class, by putting the ID in as the parameter to be searched. Let's do that and store it in the variable of Harley.
    Harley = pycomicvine.Character(1696)
  • Once we have the results of that search stored as a variable, we can then query any of the attributes available for that resource. You can find the possibilities listed in the API documentation on the site.   I'll still keep proceeding with character since there are a lot of interesting attributes to search for there, but we'll get back to searching for creators shortly.
    A list in the documentation of the different attributes that can be searched for a character object.
     
  • Like we saw in the ReadMe instructions, the way to access any of the different attributes for this Character object is to place the name of the attribute we want after the variable separated by a period. Then pycomicvine will return the information stored for that attribute. This is what I would use if I wanted to get the information what issue was this character's first appearance. Let's execute this code
    Harley.first_appeared_in_issue
  • Executing this code gives you the information on that first issue. From here you could decide you wanted to follow that thread, and use the ID number of that issue to find out more information about it.
    <Issue: Batgirl: Day One #12 [37736]>
  • Let's start by creating a variable to store the data, and then doing a query into the Issue. The object that you got back started with Issue: so you know it's that sort of resource.
    Harley_Issue1 = pycomicvine.Issue(37736)
  • We can now look at the attributes available for an Issue,  and see that one, cover_date,  is The publish date printed on the cover of an issue. Same as we did above, we can now search within that variable for that attribute by connecting the variable we want, with the attribute that we want.
    Harley_Issue1.cover_date
  • Which returns
    datetime.datetime(1993, 9, 1, 0, 0)
  • This process lets us search through the list resource for a category (Issues, People)  to find that article's ID, then use that ID to find the specific resource that we are looking for (Issue, Person), store it as a variable, and then call and print the attributes that we want to get about that resource. To continue with our scenario, let's start by doing a search in the People category for a certain writer.
  • We'll start with the variable that we want to use, Creator, then set it equal to the pycomicvine function of doing a filtered search in People for someone with the name of Kelly Thompson. We'll then print the results.
    Creator = pycomicvine.People(filter="name:Kelly Thompson")
    print Creator
  • In the return below it gives the same name, so we know it's found a match, and in between the brackets, it gives us the ID number for the page connected to this person.
    [Kelly Thompson [83841]]
  • Like we did above, we'll take the ID and use it as the parameter we use when looking for the data on that Person resource that we want. In this case we want to look into the gender, so we use that field name to call the information on that field
    creatorGender = pycomicvine.Person(83841).gender
    print creatorGender
  • When I do this search instead of getting a 1 or 2, I get the symbol for woman. Your results may vary as I see that it's somewhat inconsistent which the Jupyter notebook decides to display, the number or the symbol. Don't worry, when we later rewrite this script to send the data to a file, it will just use the 1 and 2 coding.
  • Let's try this with a creator that we already know is male in order to make sure that we get a different symbol back. We'll format the inquiry the same, just adding a 2 at the end of the name, so:  Creator2 and creatorGender2,
    Creator2 = pycomicvine.People(filter="name:Eduardo Risso")
    print
    Creator2
  • This yields a return of:
    [Eduardo Risso [18608]]
  • Nex,t take the returned ID for Eduardo Risso and search for the gender of the Person resource with that ID and print the gender.
    creatorGender2 = pycomicvine.Person(18608).gender
    print
    creatorGender2
  • This returns the symbol for male, so we've confirmed that this kind of search works to get us the data that we want.
  • Let's try getting the URL for the person on the site. That way we can scrape any data that we want from the web page later for anything that the API won't give us. We'll be searching for Eduardo Risso's data, and it's formatted just like the gender search. We want the URL, so we're going to use site_detail_url as the attribute.
    Creator2URL = pycomicvine.Person(18608).site_detail_url
    print Creator2URL
  • This returns the URL for this person
    https://comicvine.gamespot.com/eduardo-risso/4040-18608/
  • We know that the way to get the data we want is to look up by the person by name, get their ID, and then use that ID to get more information about them. That's the process we'll be wanting to automate with a script written for this Python wrapper. If you want examples of how I formatted this script - please see the below Jupyter notebook, but please note that simply executing the scripts won't work unless you put in your own API key at the top where it says "YOUR_API_KEY_HERE"

Automating with a Python Script

We've found a way to use the pycomicvine wrapper to search for the creators that we want to know the gender for, and it has worked for the creators that we've tried. Using the commands in pycomicvine we can search for their name in the People resource, take the ID that we get from that,  use it as a parameter to download their Person resource and finally call the data that's in the gender and site_detail_url fields for that person.

Now we want to try and automate that process by creating a script that will take a single name from a group of names, go through that exact process, print the result, and then move on to the next name.

  • Open a new Jupyter notebook in the same folder that your pycomicvine is in. Start your script by importing pycomicvine as well as the module time since if we are feeding in a list of names, we'll want to make sure that server doesn't get annoyed with all the requests and boot us, so we'll be setting it up to also import time, a module with a command to tell our script to take pauses between lines of code. We'll also be importing the module re which will let us use something called regular expressions or regex, that lets Python look for certain text patterns. More on that later.
    import pycomicvine
    import time
    import re
  • Next, you'll have to set the your API key up like so =
  • pycomicvine.api_key = "YOUR_API_KEY_HERE"
  • Eventually when you write a script that takes a list, you'll want each name that you use to be represented by a variable (so you can iterate through the list of names that you're feeding the script) so let's try automating the list of tasks for one person and have their name be represented by the variable Artist.
    Artist = "Fiona Staples"
  • First, let's try querying the database for her name. In the previous exercise you did this by using the .People function to filter for the person you were looking for's name by writing it as "name:Kelly Thompson", so let's try substituting the name in quotes with the variable Artist
    creator = pycomicvine.People(filter="name:"+Artist)
    print creator
  • This gives you the response of
    [Fiona Staples [52884]]
  • Last time, we manually cut and pasted the ID number that's between the brackets and used it as the parameter in the search with .Person so we could find the gender variable that went with that person. Since we want to automate this process though, we're going to want the code to grab that number between the two brackets for us and make it a variable that we can use it in the next step.

To do this, we can use what's called a regular expression or regex. To put it succinctly, a regular expression is a way of telling our script what kind of pattern of characters that we are looking for. We could ask for something in all caps, we could ask for words that start with B, we could ask for the first two words.

In this case we could say that we want the third word, but that wouldn't work because maybe some artists have one name, maybe some have three, and then what the script will be printing out for you as the ID will be false. We do however know from our previous trial runs, know that the ID # consists of only numbers, no letters, no spaces, and that it is between two brackets.

We are going to write an expression that will look at the below return that happens when it queries the database, and know to just return the part highlighted in blue.

[Fiona Staples [52884]]

  • You've already called the re module which is what Python uses to work with regular expressions. You'll need to convert your result to a string, since regular expressions will only search a string, and the return from the API is an object. We can convert to a string with the str function and store that result as the variable s
    s = str(creator)
  • We'll use the re.search function to do a search for a regular expression. This function takes two parameters, the regular expression representing the text pattern that you are searching for, and the text string or variable (in this case s) that you want to search.  
  • We are looking for just numerical characters that are placed within brackets. For the numerical characters we can use
    ([0-9]+) to make clear that we are looking for any amount of characters, but only characters that are numeric characters.
    The brackets are a special character because they are characters that are used when writing a regular expression to represent something else. You'll have to "escape" them with \ since that will let the code know you mean an actual [ or ], not what [ or ] means in a regular expression. The regular expression goes within quotation marks and we'll store the result as the variable m.  When you put all of these together, it looks like this
    m = re.search("\[([0-9]+)\]", s)
  • To make sure you got it right, you'll tell it to print that new variable m, but only the 1 group.  The group 0 text will include the brackets, since this phrase contains two sets of brackets, one nested within the other. So [52884] instead of 52884.
    print m.group(1)
  • You should get the number that was between the brackets. This is the same number that you had to cut and paste by hand in the previous attempt at coding, but now you've used a regular expression search to find it, which means that this is another step you can complete using variables and Python instead of doing it by hand.
  • Before, we were formatting the next part of the search like the below
    creatorGender2 = pycomicvine.Person(ID#here).gender
  • So let's try doing that but just using the expression that just printed out the ID number for us as what goes in as the parameter.
    creatorGender = pycomicvine.Person(m.group(1)).gender
  • Unfortunately, that gives us an error, because .Person needs to read the ID as a number, not a string, for it to work and we'd just converted the results to a string so we could search with a regular expression through it.
    Error message in console: ValueError: Unknown format code 'd' for object of type 'str'
  • No worries, just like we converted the response to a string using str(), so too can we convert the m.group(1) result to a number with int(). Let's do that and store it as a variable called creator_id, then print it so we're sure that we have it right.
    creator_id = int(m.group(1))
    print creator_id
  • Now let's try putting creator_id in the same place within our code that we know will search for a Person by ID and print out their gender.
    creatorGender = pycomicvine.Person(creator_id).gender
    print creatorGender
  • This will give us the same result of printing out the unicode symbol for women
  • Let's try the same variable creator_id as parameter for the search that found and printed the URL for where it was on the site.
    creatorCV_URL = pycomicvine.Person(creator_id).site_detail_url
    print creatorCV_URL

You've now created an automated process. To test this, go back up to the original line where you defined the Artist variable as meaning "Fiona Staples" and switch to a male artist's name to see if executing the script gives you different results. I've used "Jack Kirby". Execute each step of the code again, and you'll see all the results change. All you had to change was the one name that was stored in Artist, but each step that you set up along the way worked exactly the same as when you were writing and testing the code and gave you the answers for the new person. Now we're ready to set up a for loop to cycle through a list of names. You can see the script in my Jupyter notebook below if yours hasn't been working. You'll have to replace YOUR_API_KEY_HERE with your API key for it to work, however.

Scaling Up Your Python Script

  • First, let's open a new Jupyter notebook, and start off our code by importing the same modules we used to build the other script, as well as the API key
    import pycomicvine
    import re
    import time

    pycomicvine.api_key = "YOUR_API_KEY_HERE"
  • While our previous commands and scripts working with the Comicvine API all worked on one name at a time, now let's create one that will go through a bunch of names. We'll give it that list of names as an array, storing it as the variable creatorNames. Each name is in quotation marks and separated from the next by a comma. Keep them all on the same line, and make sure you close the bracket.
    creatorNames = ["Fiona Staples", "Stan Lee", "Brian K. Vaughan","Steve Ditko", "Chelsea Cain","Jack Kirby", "A.J. Jothikumar","Abigail Jill Harding", "Gail Simone", "Matt Fraction"]
  • We'll be performing the same actions on each of these names that we perfected in the last script, and we'll be doing so within a for loop to iterate through each name in the list, and perform the same function on it. We'll use the local variable of creatorName to split the list you have up into individual names
    for creatorName in creatorNames:
  • We'll do the search through the People resource to find the individual article that matches the person.
    creator = pycomicvine.People(filter="name:"+creatorName)
  • Transform that search result into a string with str(). Then, use a regular expressions search to find the ID in that string. This works as the ID will consistently be numeric characters between brackets, and we can use regex to search for that.
     s = str(creator)
     m = re.search("\[([0-9]+)\]", s)
  • Transform the result of the search for the ID# into an integer.
    creator_id = int(m.group(1))
  • creator_id can be used to search out the gender and site URL associated with the Person with that ID# and then print out the results.  Other than the addition of the creatorName variable and for loop, this is all the same code that you did before, so we've tested it out and know it works.
    creatorGender = pycomicvine.Person(creator_id).gender
    creatorCV_URL = pycomicvine.Person(creator_id).site_detail_url
    print creatorName, creatorGender, creatorCV_URL
  • Make sure you have it set up as follows, and execute

for creatorName in creatorNames:
    creator = pycomicvine.People(filter="name:"+creatorName)
    s = str(creator)
    m = re.search("\[([0-9]+)\]", s)
    creator_id = int(m.group(1))
    creatorGender = pycomicvine.Person(creator_id).gender
    creatorCV_URL = pycomicvine.Person(creator_id).site_detail_url
    print creatorName, creatorGender, creatorCV_URL

  • The code runs okay at first (though again I'm getting the symbols instead of 1 or 2)  but then it runs into an error, and shuts down. Probably because it ran into a name that isn't present in the database, judging by the error.
    Console has some results printed to it, and then AttributeError: 'NoneType' object has no attribute 'group'
  • Like we did before, let's set up a try and except block. This will give our code another bit of code to execute if it runs into any issue. First, do one level of indent after our for loop and put the word try: at the beginning of our code block.
    for creatorName in creatorNames:
        try:
            creator = pycomicvine.People(filter="name:"+creatorName)
            s = str(creator)
            m = re.search("\[([0-9]+)\]", s)
            creator_id = int(m.group(1))
            creatorGender = pycomicvine.Person(creator_id).gender
            creatorCV_URL = pycomicvine.Person(creator_id).site_detail_url
            print creatorName, creatorGender, creatorCV_URL
  • At the same indent level as try:, add a line that starts except: On the next line, we'll give the code another option to do when the code following try: fails. In this case, we say to print out the words not found after printing out the name. That way we'll have a record of which matches did not work out so we can find those names elsewhere
        except:
            print creatorName + ", not found"
  • Now the code prints out results for each of the name on the list (even if it didn't find any results) without sparking any errors.

In the next section, we'll move on to how to output this data to a file, but if you had trouble with this part of the exercise, see the below Python notebook

Exporting Results to a Python File

Having the results written to a text file will get rid of the use of symbols, since that kind of file will only accept plain text. Additionally, once your data is in a file you can work with it how you want. 

  • Open a new notebook, and import the modules that you'll need to execute your code, add in your API key, and copy in the creatorNames that we previously used. You are adding in modules for time so you can add commands to tell your code to pause, and the module csv that makes it more orderly to write to a file

    import pycomicvine
    import re
    import time
    import csv
    pycomicvine.api_key = "YOUR_API_GOES_HERE"

    creatorNames = ["Fiona Staples","Stan Lee","Brian K. Vaughan","Steve Ditko","Chelsea Cain","Jack Kirby","A.J. Jothikumar","Abigail Jill Harding","Gail Simone","Matt Fraction"]

  • Above the for loop, add a with statement that creates your variable f that opens the file that you want to append ("a") your data to. You can name the file whatever you want as long as it ends with .txt.
    with open('Results_Comicvine.txt','a') as f:

  • From then on, the code stays the same up until you get to the end of the try statement. After all, you're not changing anything about where and how you are getting your information, you're only adding another command for it to write the results to a file and not just print to the console. 
            for creatorName in creatorNames:
            try:
                creator = pycomicvine.People(filter="name:"+creatorName)
                s = str(creator)
                m = re.search("\[([0-9]+)\]", s)
                creator_id = int(m.group(1))
                creatorGender = pycomicvine.Person(creator_id).gender
                creatorCV_URL = pycomicvine.Person(creator_id).site_detail_url

  • Above your print command,  add a line of code that writes the results to the file with the command f.writelines().
    There are a few differences between how you format what you'll write to the file, rather than how you formatted the print command. You'll need to do a str() conversion before each variable, because you'll be mixing those variables with text: the "," that you'll want separating each variable in your file. You'll connect each variable with a + instead of a , and add in a "," in between each term and a "\n" at the end of what you want to export to the file. 
    \n is the code for creating a new line and you'll want to be able to distinguish between each result.  The end of the try statement of your for loop should now look like this.
                f.writelines(str(creatorName) + "," + str(creatorGender) + "," + str(creatorCV_URL)+ '\n')
               print creatorName, creatorGender, creatorCV_URL

  • At the bottom of this statement, add the time.sleep() function to the end since we've been playing around with this site's API for a while and don't want to get booted off their server.
    time.sleep(1)
  • We'll change the except statement in a similar way. Below the print command, we'll add another writelines() command. This time the what we tell it to write will just consist of our creatorName variable, the phrase "not found" and "\n". That way we have a record of what the results were for each name, even if the result was, we didn't find it.
    At the end, we'll add the time.sleep() function, so it waits a second whether or not it found the name, before it continues through the rest of the code.
                f.writelines(str(creatorName) + ", not found \n")
                time.sleep(1)
  • Run your program and you'll see the names and results appear in your console.
  • Open the file GenderStats_Comicvine.txt and it should appear like this:

Below I have the Jupyter Notebook file that I created if you are having trouble. Note that you'll have to replace YOUR_API_KEY_HERE with your API key.
With these tutorials, you know how to query an API, whether it has an interactive way of doing so on the site or not. You can automate the process with Python, using a wrapper if necessary, and print the results to the file. I've showed you how to do this on a few sites, but the principles will be similar for any other API. You just need to use that API's documentation to figure out how to format the query that you need, and request away.