O'Malley Library: DAsH: Getting Plain Text from Databases

Getting Plain Text from Databases

Most of the time when you are using our school's databases for research, you've been looking to access a PDF so you can closely read the article, analyze it individually and use information or arguments within it from your paper. However, if you are looking to gather a large amount of articles on the same topic so they can be looked at as a group to be analyzed for a pattern, PDFs can be less helpful because many text-analysis tools will only work with text pasted into an interface or a txt file. Additionally, sometimes you want to gather a large amount of articles that match your topic rather than download each individually. For some of the material that our databases have access to, this is possible. Keep in mind that the tricks that are being shown tend to work for more modern magazine or newspaper articles, as a lot of historical newspapers or magazines might exist only in PDF format and have not been transcribed into plain text, so it isn't everything you're able to find in here, but if you wanted to, say, do a text analysis of modern media coverage of an event or topic model more recent scholarly articles on a topic, this is where you could find some source material for analysis in a plain-text format.

Downloading Plain Text from Databases As a Batch

Getting Plain Text from EBSCO
Getting Plain Text from Proquest

EBSCO is the parent company platform for several of our databases, including OmniFile, EconLit, and Discovery Search. It has less of a focus on newspapers and more on trade publications, academic journals and magazines, so think about that when deciding whether to use this as the basis for gathering your sources for text analysis.

For this example, we'll see how to collect part of a corpus to use to see the differences between how comic books were written about in popular media in the 1990s vs the 2000s vs the 2010s. If we want to do text analysis with tools like Voyant or IBM's Natural Language Tools demo but first we'd to get a large sample of magazine articles on the subject

To do this, start off with comic books as my subject term and do a search. Next,narrow down the publication dates to 1991-1999 and select limiting to Full Text and to limit the results to Trade publications, magazines and reviews. You can do that using the filtering options available on the left hand side of the page.

Indicates that the Full Text can be selected as a Limit To option, the publication date range can be changed and that source types can be selected.

Once the filters have been added, the amount of results for that subject are 30. Within the search results, next to each item, there are two icons, one with a magnifying glass over a paper, that if you hover over will give you more information about the article, and the second has a plus sign atop a folder. If you click on that one you'll see the icon change color and it will tell you that it has added that item to your folder.

A Folder is a way you can make a collection of items in the EBSCO database. Most of the time we'd want to select items individually, however if we had reason to expect that all the results in a search were valid, like if the search was for something very seldom written about, or if the results were small and we took a look at the summaries of each and determined they met the criteria for being included in the dataset to analyze, the full text of all search results can be saved by going up to where it says Share at the top of the search results and underneath Add to Folder, choosing Results (1-30) .

It's important to note that these results aren't in a folder permanently unless you've created a profile with EBSCO and are signed in, so don't expect them to always be there. But for the purposes of this example, we'll download them immediately.

Once we've populated the folder with items that we want to get the text of, go to the Folder icon at the top of the screen, and select it. You'll see all the items that are within it. If there are any items you want to leave out you can just check the box and choose Delete Items, for instance, I'm going to be deleting the one that indicates that it does not, in fact, have full text.

Next, to save all of the items in the folder, check all of the boxes using Select / Deselect All and then go to the side and choose Save as File

An Export/Save window will pop up with options for how we want to save it. I'll be deselecting HTML link(s) to articles since I plan on processing this data later to remove everything that isn't the text of the article, but depending on whether you'd like to be able to easily get back to the articles that you've found or not, you might leave it checked. Make sure that HTML Full Text (when available) is selected.

Once you have the options that you want selected, choose Save.

Clicking Save, will take us to a page with the full text of all the selected articles. There are instructions at the top for how to save the information within different browsers. You can also choose to just highlight and copy all the information within this page, and paste it into a plain text program such as Notebook++, that's what I'll do since the only options for Firefox and Chrome are to save it as HTML which will have HTML code that isn't going to be helpful.

You'll notice this text includes publication and other information about the articles that might get in the way of doing text analyses that you might want to delete from the text. You should create a separate copy of this file to do any editing like that. If you decide to sort the information in the second file in any way or go back to the original source, it will be far easier to just open your original copy of the data and start a new file than it will be to duplicate your search and saving of results and find the information that you need.

Many of our articles and journals that we have access to reside in our Proquest database. In this tutorial, you'll find out how to do a search and download the plain-text of the results so that you can apply text analysis to them later. This database has scholarly articles, trade journals as well as newspapers and magazines.

For this example, we want a large group of articles to help us see the differences between how climate change was written about in news articles in the 1980s vs. the 2010s in popular media to later feed into text analysis tools like Voyant.

Let's start on the Advanced Search page for Proquest, and choose Newspapers and Magazines as the Source Type we want. Next to gather 1980s articles, go to where it says Publication Date and make the Start and End dates 1980-1989. Finally, check the box to ensure that the results are limited to items with full text. Then, select Search.

From the results page, use the dropdown on the right to sort by Relevance, then scroll through the first page to make sure nearly all articles seem like they'd be useful texts to include. In this case, most do, so it will be easier to select all of them and then just de-select the ones that aren't actually related afterwards. So, go up to the top of the results and choose the box that says Select 1-20, which checks the box next to all the options on this page.

Next, let's fine-tune by scrolling through the results and de-select any of the ones that seem unrelated to global climate change by un-checking the box. We can see a summary of the article by choosing the link that says Show Abstract, that's how to find out for example that this result is actually about how a marathon runner did less well in a climate that was a change from what she was used to, so...not related to the coverage we are looking to find.

Once satisfied with the results that are selected on this page, Scroll to the bottom and choose the link to page 2

Follow the same procedure of selecting all the items, and then scrolling through the results to de-select the ones that aren't actually about global climate change. The number of selected items at the top of the results will shift to 35. That is a link that can be clicked to see what items are selected.

When the items we want are selected, go to the far-right icon atop the search page with the three dots on it, and click on it. This will tell you all the save options. In the pop up window, choose Text Only which is at the bottom under the heading Other Options.

Selecting the Text Only option, will take you to the pop-up window for that save option. Make sure that for the Output to: dropdown, Text Only selected, and for the Content: dropdown , Full text is selected.

Below that there is an anti-robot test with a group of letters you have to identify and then you just need to click the Continue button.

It will take you to a window while it is processing, and then download the text file for you. The file will go to your should go to your download folder unless you've set a different default for downloads.

I recommend immediately naming the file you just downloaded something you can easily identify instead of the generic ProQuestDocuments and the date. In my case I'll go with PopularMedia_ClimateChange_1980-1989

When I open it in Notepad++ and scroll through, I can see that this file has additional information beyond the plain text of the article, like its url and publication date. This might be problematic to have in the file when I try to do text analysis of it, so I'll delete it later. Please remember, however, to always create a new copy of a file like this to do your edits on, and leave this original one intact. That way if there is an issue or you decide you want to organize your information in a different way, you can go back to the original file and start from there instead of trying to duplicate the search that got you these articles.

In this example, we've downloaded everything we've selected from this search as one group, but in other situations, you may want to say, sort the results from your search by date and save all the articles from one year together, or all the articles from one magazine in one file, or all the articles you found with one key word vs. another as one group. You'd do this by selecting the articles that you want to group together into a file, downloading that file, then going back to Prooquest and using the Clear link at the top of the results page to wipe the slate clean and start selecting the next group of articles you want to download as one text file together.