Regular expressions (also called regex) is a method supported by many programming languages and text editors that allows you to not only search for certain exact keywords or phrases but also for certain patterns of characters within a text.
This means that by using combinations of characters, wildcards and other symbols it can be used to look for the following and more:
With ordinary search methods, you have to be more exact with what you are searching for because you can only ask it to look for an exact series of characters, rather than a pattern. Regex lets you cast a wider net.
For instance, let's say you had a transcription of someone’s diary and you wanted to find every mention of a phone number. However while the author always wrote out the full phone number, they seldom mentioned the word phone or call beforehand.
Without that text clue at the front, you’d have a difficult time searching for phone numbers using typical search methods. With those methods, you’d have to go through all manner of three digit number configurations to try and make sure you’d tracked down every phone number by searching from 001 to 999 and checking each result to make sure that you didn’t just find an instance of the person writing down a 3 digit number for another reason.
With regular expressions, you can take a look at the kinds of phone numbers that you want to find like say, 212-555-2261 and take note of the pattern within it. 3 digits followed by a dash, 3 more digits, followed by another dash and then finally 4 digits. Or as it can be written out as a regex search - \d{3}-\d{3}-\d{4}
That may look strange, but all it literally translates to is you want it to find digit or numeric characters \d lined up in a group { } of 3 {3} followed by a – then the same thing again \d{3}- followed by another group of digits, but this time you want there to be 4 of them, so you change the number in the brackets to be a 4 \d{4}
\d{3}-\d{3}-\d{4}
Regular Expressions are a powerful method to find broad amounts of text that match a given pattern and so are often used for data validation or for find-and-replace operations in a document. The latter is what I’ll be using it for in the tutorials below, to show you how to use regex to find certain blocks of text that you want to cut from your document to make it easier to analyze, and then cut those sections as a batch rather than individually.
A word of caution though - it is a pretty easy mistake to write a regex that applies to text you don’t actually want cut and not even know you’ve gotten rid of it until a much later step of the process. So a few ground rules:
I’ll be using Notepad++ in the examples below but the same formulas should work for you in text editors with regular expression support that are available on Macs. A good quick reference sheet for Regular Expressions has been created by MIT and it will be the one I use throughout this tutorial.
If you want to get a large body of newspaper or scholarly article texts on a certain topic or from a certain date range to use text analysis on, you can use Proquest's batch download function to do so and save the whole thing as one file. That process is explained on a separate DAsH tutorial.
However, that file is going to be formatted to not just contain the full-text of those multiple articles. Metadata (a term meaning literally data about data) such as the title, author, subject, original publication and other information will be included both before and after each article as well as a line separator in between them. These headings are necessary if you just want to use the text file to take notes from when you are doing a close reading of each article. Otherwise there would be not other way to be able to identify who wrote what and where. But if all you want is the plain text of each article so you can analyze them as a unit, this extra information is something you'd want to cut out.
In this tutorial, we’ll take a file that contains plain text of articles downloaded in bulk from Proquest and edit it down so it just contains the article text without any of the information about publisher, links, author, etc that could get in the way of the text analysis you plan to do of the content of the articles.
Regular Expressions are a way of summing up the kind of words, numbers or blocks of text that we want to single out in a file based on what patterns those words, numbers or text blocks fit or not. In this case what we'll be summing up is what we want to delete. We can do this using a group of symbols that describes the pattern that these groups of text match. If we can identify the pattern of those irrelevant text blocks and figure out how to express that pattern to the program as a regular expression, the text editing program can do what would be repetitive editing work for us and take those sections out all at once.
From this look-through of the file, we have identified what the qualities are of the sections we need, but more importantly we have identified the pattern of everything else that we don't need.
Now that we know what everything that we want to cut looks like, we just need to sum it up as a regular expression so that we can tell Notepad++'s Find and Replace function how to find these sections and replace them with nothing or with something else.
We need to write an expression that will take Author: as a starting point, and match it and everything else between it and Full text: after which it will stop the selection. Let's check the cheat sheet made by MIT for regular expressions here and find some information that can help us.
The first table it tells us that . means literally any character: number, word, space, punctuation, whatever. So we can use . as the symbol to substitute for any of the characters between Author: and Full text:,
But if we just use it once, it will only look for one character.
Put between Author: and Full Text:, it looks like this
Author:.Full text:
and will find the first character between these two, but that one only. The number of lines and characters between these two points in the text varies greatly so we need to find a way to tell it to look for any number of characters as long as it stays between that start and end point
This is what we'll find in the "Quantifiers" section of the cheat sheet. This tells you what to add to the symbol or character we want the regular expression to find a match for then specify how many times you want that kind of character to appear in the match you're looking for. Here are the quantifiers that will help us for writing this regular expression:
So if we wanted to write a regular expression that matched with Author: and whatever is between that and Full Text: any time those phrases appeared in that order within the text file, it would look like this.
Author:.*?Full text:
Now that we have confirmed that our regular expression matches what we want to edit out, let's replace what it has matched with blank space, in other words, deleting it.
Now let's take care of that one article that was the exception. Since it's just one, you could easily just delete it yourself manually, but let's say that this file had multiple times where Publication Info: was the first line between the article title and article content. The neat thing about regular expressions, is when you've found one that works, it's very easy to customize that expression for other situations.
We know that Author:.*?Full text: works, so to change the starting point of what the expression matches with from Author: to Publication Info: it's as easy as taking Author: out of the beginning of the regular expression and swapping in Publication Info: making it -
Publication Info:.*?Full text:
Scroll through the document, now there is only the metadata after the article but before the next article left for us to get rid of. We know that the formula of Beginning Of Text to Cut .*? End of Text to Cut to set up our regular expression worked before, so let's try it again with this text.
Now you're ready to get rid of the extra information at the bottom of the articles but in this instance, let's say you want to keep the line separator between articles so it'll be easier later to tell the divisions between the articles if you are curious after seeing your text analysis results. That'll change what we put in the Replace box.
There's still some spot cleaning to be done. It looks like one article still has Share: between the content of the article and the line separator marking the beginning of the next one, and at the very end of the document, there's some information about contacting Proquest, but the first one only occurs a few times and the second only once so they will be easy to edit out by hand.
This took some time to figure out the first time we've tried it, but now that you know a bit about regular expressions work, you can to adapt them to other text documents. You can even, if you have a bunch of different Proquest files that you know are set up like this, record and play back the same steps upon those files so you don't even have to do these steps more than the first time. This function is called a Macro.
Making a macro is a way of automating a series of repetitive tasks. Executing a macro is like giving Notepad++ a list of steps you want it to execute on a piece of text and having it do those tasks for you. If you know that you have a series of texts that you want to have the same kinds of actions performed on - like if instead of this one Proquest batch file you had a few dozen of them - you can record the steps that you want to be taken with each Proquest batch file, save that recording as a macro, and apply the macro to any number of other files.
We've now recorded a macro that automates all the steps that we figured out on how to use regular expressions to clean up a Proquest batch download file. To see what this macro does, let's run it on a file that needs to be edited.
Keep in mind when you are creating a macro, you want to be very sure that it isn't going to mess up the documents that you'll apply it to. You might want to try it on a few different examples of files you want to batch edit to make sure there aren't any unexpected consequences to the edits it applies that you don't like. In some cases, it might make sense to record a few smaller macros, rather than one large one and run them one after the other.
Dialogue in film or television can be interesting material for text analysis. Sites such as Opensubtitles.org will let you download the subtitles (.srt) files for some movies or television series. Though an srt file doesn’t contain information about who is speaking which line of dialogue, if you want to analyze one movie, or a group of them using just the lines of dialogue spoken within them, the srt files can be a great source to draw from.
SRT files are designed to be used in conjunction with a piece of media and so along with the lines of dialogue, the file also contains the timecodes and order in which that dialogue should appear as subtitles when the video is played. While this additional numeric information is useful to your computer or other player being used to watch the movie, it isn’t useful to us if we want to analyze just the text. If this numeric information is still included in the file, our methods of analysis could be thrown off by the numbers or timecodes also present within the file.
Thankfully for us, because srt files are intended to be used by multiple different kinds of software, their format is standardized. This will make it easy for us to use regular expressions to find and delete the timecodes or other numeric indicators from the file and then apply the group of regular expression that we find to allow us to do that to multiple files.
In this tutorial, we’ll take an srt file, delete the timecode and other numeric data from it using regular expressions, reformat the irregular spacing within it, and turn it into a file just containing the text of the dialogue and captions within the movie.
Regular Expressions are a way of summing up the kind of words, numbers or blocks of text that we want to single out in a file based on what patterns those words, numbers or text blocks fit or not. In this case what we'll be summing up is what we want to delete. We can do this using a group of symbols that describes the pattern that these groups of text match. SRT files have a standardized way of expressing the time and way that subtitles occur so there is a pattern that can be found in how those bits of numbers we want to take out are written. If we can identify the pattern of those irrelevant text blocks and figure out how to express that pattern to the program as a regular expression, the text editing program can do what would be repetitive editing work for us and take those sections out all at once.
Now that we've decided which parts of the SRT file we want to get rid of or formatting we'd like to change, it's time to figure out what regular expressions we can use to match those parts so we can edit them as a batch.
We've gone through and seen the pattern of the parts of this subtitle file that we want to cut out. Now it's time to figure out how to sum them up as regular expressions so we can get Notepad++ to cut them out of the document.
To translate that timecode into more basic and abstract terms (so we can sum it up as a regular expression), it’s 2 digits, followed by a : then 2 more digits, followed by another : then 2 more digits, this time followed by a , then the last 3 digits and then an arrow made up of --> after which the sequence of digits, colons and a comma repeats once more. So we’ll need a way to summarize that pattern with a regular expression.
Fortunately, by consulting the cheatsheet linked to above we find a couple of things that can help us. For starters, there is a symbol that means a digit which is \d .That means we could just copy the code above and every time there is a digit we could substitute in \d which would work just fine, but there is an even shorter way to do it.
There are also quantifiers that can be added to any symbol. They are much like the wildcards used in the exercise about the Proquest file. One way of using a quantifier is by placing a number directly in brackets next to the character. This will indicate you are trying to find exactly that many occurrences of that character next to each other. If you just search for \d it will find an individual digit. If you search for \d\d it will look for two digits next to each other and if you search for \d{2} it will do the same thing as \d\d, look for two numbers next to each other.
Now, we should be able to just replace all the numbers in that time code with the regular expression marking out the number of digits that we want it to look for, but since there are other kinds of punctuation characters within the time code - (:,-> - we should double check that none of them are considered special characters.
Special characters are characters that have a special meaning in regular expressions, but you'd recognize them the rest of the time as being characters often used as punctuation like . or ? or (). If special characters exist in the phrase that you are looking for you have to “escape” them with a slash. Fortunately none of the punctuation characters that are in the code :,-> are on the special characters list (though if the – were inside brackets it would need to be escaped according to the special characters list).
Now that we're sure that the regular expression that we've written finds what we want it to do, time to use that fact to cut out what we no longer want in the file.
Your file should now have only the text of the subtitles in it now, though there are a lot of empty lines and sometimes sentences are spread over multiple subtitles. We'll figure out how to change that next.
"LADY WINDERMERE'S FAN"
Lady Windermere faced the
grave problem-- of seating
her dinner guests.
Lord Darlington
"Lord Darlington."
"I presume you came
to see my husband."
Dear Lord Windemere--
This note from a total
stranger may surprise you--
...but is is important
that you see me at once
if you would avoid certain...
...unpleasant disclosures.
Yours very truly, Edith Erlynne.
"Lord Darlington is most
anxious to see you."
"You don't mind if
I run away?"
"LADY WINDERMERE'S FAN" Lady Windermere faced the grave problem-- of seating her dinner guests. Lord Darlington "Lord Darlington." "I presume you came to see my husband." Dear Lord Windemere-- This note from a total stranger may surprise you-- ...but is is important that you see me at once if you would avoid certain... ...unpleasant disclosures. Yours very truly, Edith Erlynne. "Lord Darlington is most anxious to see you." "You don't mind if I run away?"
"LADY WINDERMERE'S FAN"
Lady Windermere faced the grave problem-- of seating her dinner guests.
Lord Darlington
"Lord Darlington." "I presume you came to see my husband." Dear Lord Windemere-- This note from a total stranger may surprise you-- ...but is is important that you see me at once if you would avoid certain...
...unpleasant disclosures.
Yours very truly, Edith Erlynne.
"Lord Darlington is most anxious to see you." "You don't mind if I run away?"
"LADY WINDERMERE'S FAN"
Lady Windermere faced the grave problem-- of seating her dinner guests.
Lord Darlington
"Lord Darlington."
"I presume you came to see my husband."
Dear Lord Windemere-- This note from a total stranger may surprise you-- ...but is is important that you see me at once if you would avoid certain......unpleasant disclosures.
Yours very truly, Edith Erlynne.
"Lord Darlington is most anxious to see you."
"You don't mind if I run away?"
A macro is a way of automating a series of repetitive tasks, it's like giving Notepad++ a list of steps you want it to execute on a piece of text. If you know that you have a series of texts that you want to have the same kinds of actions performed on, like say, if instead of this one subtitle file you had a few dozen of them, you can record the steps that you want to be taken with each document one time, save that recording as a macro, and apply it to any number of other documents.
Keep in mind when you are creating a macro, you want to be very sure that it isn't going to mess up the document since executing it carries out all the steps at once rather than with breaks in between for you to assess how the edit looks. You might want to try the macro on a few different examples of files you want to batch edit to make sure there isn't any bugs in the way that it is set up. In some cases, it might make sense to make a few smaller macros rather than one large one and run them one after the other in some cases.