API is the abbreviation for application programming interface. Large companies or websites build these interfaces on top of the servers where they store their data so they can access and ask questions of that data quickly. Sometimes these groups make their interface public and sometimes they sell access to it. Publicly accessible APIs can be a great place for you to pull data from.
For example, if you wanted to search Twitter to see how many people were tweeting the hash tag #manhattanhenge on the date of this astronomical event every year, you could just search Twitter's regular advanced search interface, and then copy and paste all of the results manually. But if you wanted to do this efficiently, you could use Twitter's API to pull this information for you, and then give it to you in a format (usually JSON) that you can use to graph or otherwise analyze that information.
Some people even use those APIs to build their own programs, if they know a lot of coding languages. For instance, someone at a ticketing company could build a program that would display what the weather is supposed to be like on dates where tickets were sold for outdoor events. They would build a program that takes the zipcode of the event, the date and the time, and access the National Weather Service's API using that information to see what the weather will be on that location and date, then display that data to a user looking to buy tickets.
With these tutorials, you'll see how to access an API using a web interface, how to query it manually with a URL, how to use a Python script to querying it and learn how to use a Python wrapper or library that someone else built for a specific database. These exercises are designed for someone who is already familiar with the basics of Python, so for them to be useful to you, you'll need to already know the basics of how to write and understand Python code. You can learn basic Python for free at Codecademy or if you have a New York Public Library card, they have tutorials on Lynda for Python.
In some websites built on databases, they make their API available as a web interface. This makes it easier for someone who is less code-savvy to still get information from the site. An API is queried when someone feeds it a url containing the keywords or categories to be searched for as well as any other parameters necessary to narrow the search.
An API query looks something like the below, it consists of the API's URL, as well as additional information about what kind of search to do, what to search for and any parameters or limits that are added.
https://marvel.fandom.com/api/v1/Articles/List?category=X-Men_members&limit=25
If there is a web interface, people querying that API don't need to know how to format an API query or how to automate that process. With a web interface, we can just fill in some boxes and the site will do the rest for us. Sometimes you will still need a key, for this example you will not. For this exercise, we'll be using the Marvel fandom wiki to pull all the articles for comic book issues that came out in 2018. This is something I did while doing a large-scale project about comics, and in my case, I was grabbing these articles' URLs so I could then scrape information from the article pages. For more information about web-scraping, see the DAsH web-scraping tutorial..
You've just used this website's API interface to get a list of all of the URLs for the comics that came out in 2018. Next we'll see how to make a similar query without having to go through the full interface to do so.
In the next section, we'll explore how to use Python to construct an API query. We'll see how using Python will allow you to only pull certain parts of the JSON results, such as a title and a url, and leave off extra information. This can lead you to having data that is easier to use.
In the last exercise, we learned how to use an API that had a built-in interface, but not all are that user friendly. Sometimes you'll have to manually create your own query URL to feed into the API. You can do this right in the address bar of your browser if you want to, like we saw in the end of the last section, but you can also do this using Python. One of the advantages of using Python to do this is you can automate the process, which can mean making a sample script that you can plug multiple searches that you want to make into, can mean making multiple queries in succession, and can mean writing a script that will store the results of your query as a nice organized table. We'll be using the Marvel fandom wikia's API again for this exercise and using it to pull a list of female character's names and article URLs to save to a file.
To start with, let's go back to the API documentation and make sure that we know all the terms used with the parameter and base URLs for the query
<Response [200]>
We've used Python to query an API, and then to parse through the results that we get back to turn them into this nice orderly list. But. this information now only lives within the console. If the program restarts, you would lose it. Additionally, if you wanted to take this information and tell another program to look at it, like say, a text-analysis program so you could see which names were the most common, you wouldn't be able to do that if it weren't its own separate file, like a text or csv or xls file. Next up, we'll take these results and write them to a file, so we have a separate copy of all the information that we have found.
The above script works great if we are just looking to get quick answers to a console from an API. But the next step if we want to use this wide swath of information to continue our research project is to save it to a file. Next up, we'll be altering the script that we wrote, so that it doesn't just print the organized results of our API query to the console, but also writes them to a file.
The console fills up with the same information that you were seeing before, but as you scroll you'll see that it runs into an error message after item 196.
The only major thing we really changed versus last time is writing our results to a file instead of just printing them to the console. When we just printed them to the console earlier, it went all the way to the end of the list. Let's try opening the file to see if there's anything that can tell us what went wrong there.
Let's revise this with a try and except block for the part about writing to the file. This way the script can continue through the rest of the results when it runs into an issue with one of the individual items we got from the API.
import requests
import json
base_article_url = "https://marvel.fandom.com/api/v1/Articles/List?"
f = open("API_response_revision.txt", "a")
ladies = "Female_Characters"
howMany = 10000
payload = {'category':ladies,
'limit':howMany}
r = requests.get(base_article_url, params=payload)
print r.url
data = r.json()
for i in range(10000):
name = data['items'][i]['title']
fullURL = "https://marvel.fandom.com" + data['items'][i]['url']
print " |" + str(i) +","+ name + ", " + fullURL
try:
f.write(" |" + str(i) +","+ name + ", " + fullURL)
except:
f.write( "|"+ str(i) + ", bad text")
f.close
When we execute this version with a try and except statement the for loop keeps going until it reaches the end of the 10000 range that we set it to, because we told it what to do when it couldn't write the name and url to the file like we wanted it to.
Navigate to your new file - API_response_revision.txt in Explorer or Finder and right-click on it. Choose Open With > Notepad++ (or your chosen text editor for Mac).
Do the same find and replace that you did before, telling it to find | and replace it with \r\n.
Our file now goes almost to the same extent as the console (the last 5 are left off for some reason on mine). Let's see how it wrote out number 196 that we had that trouble with earlier. Press Ctrl+F and search for 196,
Now in the text file, that item is listed as 196, bad text
Because you printed your results to the console, you could just replace this one manually by cutting and pasting what it has in the console, into your text file.
Before we decide that we're just going to do this all by hand, let's see how many times this error pops up in our file. If it's happening a significant chunk of the time, we might decide to do some more fidgeting with our code to make sure this error doesn't pop up frequently. Do a search in your API_response_revision.txt file for , bad text and click on Find all in Current document.
The window that opens below shows how many times that phrase appears (33) and which lines it appears on. For me 33 things I need to fix by hand with a simple cut and paste out of 10,000 results is efficient enough, but if that number were a few hundred, then I might start poking around on Stackexchange to figure out how to get my code to write accent marks into a text file.
With this section of the tutorial, you can now use Python to query an API, print those results to a console, parse through those results to pull just the data that you want, and then finally write that data to a text file that you can then easily edit for use with the next phase of your project. Keep in mind, this particular API was very user friendly. It had an interactive version you could try out on the website. It didn't require you to get a key. Finally, it let you make queries of it directly with Python. Not all APIs that you want to get data from will be that straightforward.
In the next tab, we'll use a different API, Comicvine, that requires you to use what's called a wrapper to make queries to it with code, and requires you to get a key to make those queries.
Below is the Jupyter notebook file that I used for this section of the tutorial, if you want to see how I did it.
In the previous sections of this tutorial, you learned how to use Python to query an API, sort through the results, and write those to a file. However, not all APIs are set up to take Python calls. In this case, you'll be looking to access an API that doesn't accept requests from Python, so you'll be using a wrapper that someone created for this API, so that you can write Python code on your end, and it will convert it to a query that the API will accept. Incidentally, this API also requires a key, so you'll find out how to query APIs that need you to get one of those.
In this section, we'll be working with the API for the website Comicvine in order to be able to use the site to look up a long list of writers and artists and get a list of those creatives' genders.
We'll get an API key from Comicvine, and do a couple of manual queries of the program to ensure that it has the information that we need. After that, we'll write a code using a Python wrapper called pycomicvine written by users of the database website, since querying it directly with Python doesn't work. As far as I can tell from commentary on their message boards, this is due to concerns about excess scraping slowing down their site. We'll be installing the code from github, then using the pycomicvine wrapper to feed the API a list of names and the our code will look up those people in the API and tell us their gender.
After going through these steps it seems that this resource works to get the information that you need, so time to automate it, right? Unfortunately when I went through the same steps as the last section to automate the process of requesting an API, I found my request blocked, even when the URL that I had generated worked when I pasted that back in the browser's URL bar.
At most levels of DAsH coding or other programming-related work, you are going to run into issues, and any kind of code work is just as much about knowing how to ask for and look for help as it is about skills and tricks that you already know. So I went onto the developer forum for the API on Comicvine, and saw that other people had experienced this issue, and one of the answers on the forum contained a link to a Python wrapper called pycomicvine. It has modules that lets users use functions to better automate the process of searching the API. So I decided to download and use that wrapper.
A Python wrapper (sometimes called decorator) is when someone creates a series of modules to simplify the code the user has to write on their end to get their desired result. The Comicvine API was tricky to use with Python, so someone wrote a wrapper for it that will let the user use simple commands, which will access functions within the wrapper that jump through a bunch of hoops to query the API in the way that the user wanted. It's kind of like a mediator between the user and the API. This wrapper package exists on a site called Github where people post code that they have written to publicly share it so that other people can use it (and test it too).
If this works without sparking an error, then that means that you can start using this wrapper. If not, make sure that you have your notebook open in the right folder in Pycomicvine. The module should be able to just be installed and usable from anywhere on your computer, but different machines depending on their setup can be kind of buggy about how they deal with importing modules that were added by the user.
In this section, you'll use the pycomicvine package to query the Comicvine database to find different comics creatives and their gender.
Characters[<Character: Harley Quinn [1696]>]
<Issue: Batgirl: Day One #12 [37736]>
datetime.datetime(1993, 9, 1, 0, 0)
[Kelly Thompson [83841]]
[Eduardo Risso [18608]]
https://comicvine.gamespot.com/eduardo-risso/4040-18608/
We've found a way to use the pycomicvine wrapper to search for the creators that we want to know the gender for, and it has worked for the creators that we've tried. Using the commands in pycomicvine we can search for their name in the People resource, take the ID that we get from that, use it as a parameter to download their Person resource and finally call the data that's in the gender and site_detail_url fields for that person.
Now we want to try and automate that process by creating a script that will take a single name from a group of names, go through that exact process, print the result, and then move on to the next name.
[Fiona Staples [52884]]
To do this, we can use what's called a regular expression or regex. To put it succinctly, a regular expression is a way of telling our script what kind of pattern of characters that we are looking for. We could ask for something in all caps, we could ask for words that start with B, we could ask for the first two words.
In this case we could say that we want the third word, but that wouldn't work because maybe some artists have one name, maybe some have three, and then what the script will be printing out for you as the ID will be false. We do however know from our previous trial runs, know that the ID # consists of only numbers, no letters, no spaces, and that it is between two brackets.
We are going to write an expression that will look at the below return that happens when it queries the database, and know to just return the part highlighted in blue.
[Fiona Staples [52884]]
You've now created an automated process. To test this, go back up to the original line where you defined the Artist variable as meaning "Fiona Staples" and switch to a male artist's name to see if executing the script gives you different results. I've used "Jack Kirby". Execute each step of the code again, and you'll see all the results change. All you had to change was the one name that was stored in Artist, but each step that you set up along the way worked exactly the same as when you were writing and testing the code and gave you the answers for the new person. Now we're ready to set up a for loop to cycle through a list of names. You can see the script in my Jupyter notebook below if yours hasn't been working. You'll have to replace YOUR_API_KEY_HERE with your API key for it to work, however.
for creatorName in creatorNames:
creator = pycomicvine.People(filter="name:"+creatorName)
s = str(creator)
m = re.search("\[([0-9]+)\]", s)
creator_id = int(m.group(1))
creatorGender = pycomicvine.Person(creator_id).gender
creatorCV_URL = pycomicvine.Person(creator_id).site_detail_url
print creatorName, creatorGender, creatorCV_URL
In the next section, we'll move on to how to output this data to a file, but if you had trouble with this part of the exercise, see the below Python notebook
Having the results written to a text file will get rid of the use of symbols, since that kind of file will only accept plain text. Additionally, once your data is in a file you can work with it how you want.
import pycomicvine
import re
import time
import csv
pycomicvine.api_key = "YOUR_API_GOES_HERE"
creatorNames = ["Fiona Staples","Stan Lee","Brian K. Vaughan","Steve Ditko","Chelsea Cain","Jack Kirby","A.J. Jothikumar","Abigail Jill Harding","Gail Simone","Matt Fraction"]
Above the for loop, add a with statement that creates your variable f that opens the file that you want to append ("a") your data to. You can name the file whatever you want as long as it ends with .txt.
with open('Results_Comicvine.txt','a') as f:
From then on, the code stays the same up until you get to the end of the try statement. After all, you're not changing anything about where and how you are getting your information, you're only adding another command for it to write the results to a file and not just print to the console.
for creatorName in creatorNames:
try:
creator = pycomicvine.People(filter="name:"+creatorName)
s = str(creator)
m = re.search("\[([0-9]+)\]", s)
creator_id = int(m.group(1))
creatorGender = pycomicvine.Person(creator_id).gender
creatorCV_URL = pycomicvine.Person(creator_id).site_detail_url
Above your print command, add a line of code that writes the results to the file with the command f.writelines().
There are a few differences between how you format what you'll write to the file, rather than how you formatted the print command. You'll need to do a str() conversion before each variable, because you'll be mixing those variables with text: the "," that you'll want separating each variable in your file. You'll connect each variable with a + instead of a , and add in a "," in between each term and a "\n" at the end of what you want to export to the file.
\n is the code for creating a new line and you'll want to be able to distinguish between each result. The end of the try statement of your for loop should now look like this.
f.writelines(str(creatorName) + "," + str(creatorGender) + "," + str(creatorCV_URL)+ '\n')
print creatorName, creatorGender, creatorCV_URL
Below I have the Jupyter Notebook file that I created if you are having trouble. Note that you'll have to replace YOUR_API_KEY_HERE with your API key.
With these tutorials, you know how to query an API, whether it has an interactive way of doing so on the site or not. You can automate the process with Python, using a wrapper if necessary, and print the results to the file. I've showed you how to do this on a few sites, but the principles will be similar for any other API. You just need to use that API's documentation to figure out how to format the query that you need, and request away.