Skip to Main Content

DAsH

Research Guide for DAsH (or digital humanities) resources and tools

What are Regular Expressions?

Regular expressions (also called regex) is a method supported by many programming languages and text editors that allows you to not only search for certain exact keywords or phrases but also for certain patterns of characters within a text.

This means that by using combinations of characters, wildcards and other symbols it can be used to look for the following and more:

  • Specific terms like a regular search – cat finds cat
  • Specific characters only within a word or only at the end of a word – searching for *cat will find you bearcat but not category, cat* will find you category but not bearcat 
  • Specific kinds of characters like capital letters, lowercase letters, numbers, numbers within a certain range
  • Patterns of certain kinds of characters, for instance you can look only for email addresses that fit the pattern of name01@school.edu without needing to know all the names you’d be looking for, all the school names or even all the numbers.

With ordinary search methods, you have to be more exact with what you are searching for because you can only ask it to look for an exact series of characters, rather than a pattern. Regex lets you cast a wider net.

For instance, let's say you had a transcription of someone’s diary and you wanted to find every mention of a phone number. However while the author always wrote out the full phone number, they seldom mentioned the word phone or call beforehand. 

Without that text clue at the front, you’d have a difficult time searching for phone numbers using typical search methods. With those methods, you’d have to go through all manner of three digit number configurations to try and make sure you’d tracked down every phone number by searching from 001 to 999 and checking each result to make sure that you didn’t just find an instance of the person writing down a 3 digit number for another reason.

With regular expressions, you can take a look at the kinds of phone numbers that you want to find like say, 212-555-2261 and take note of the pattern within it. 3 digits followed by a dash, 3 more digits, followed by another dash and then finally 4 digits.  Or as it can be written out as a regex search - \d{3}-\d{3}-\d{4}

That may look strange, but all it literally translates to is you want it to find digit or numeric characters \d lined up in a group { } of 3 {3} followed by a – then the same thing again \d{3}- followed by another group of digits, but this time you want there to be 4 of them, so you change the number in the brackets to be a 4 \d{4}
\d{3}-\d{3}-\d{4}

Regular Expressions are a powerful method to find broad amounts of text that match a given pattern and so are often used for data validation or for find-and-replace operations in a document. The latter is what I’ll be using it for in the tutorials below, to show you how to use regex to find certain blocks of text that you want to cut from your document to make it easier to analyze, and then cut those sections as a batch rather than individually.

A word of caution though - it is a pretty easy mistake to write a regex that applies to text you don’t actually want cut and not even know you’ve gotten rid of it until a much later step of the process. So a few ground rules:

  • Always do this editing and playing around with different possibilities in a new version of your file and make sure that a version exists of the raw text file you're trying to clean up. That way you can always go back to the original if you’ve gotten rid of a too broad selection and need to try again.
  • Always run through the regular expressions search that you’ve created 4-5 times just in the find mode in the text editor to make sure it isn’t selecting anything that you don’t intend it to, before you replacing that found text with something else. If that search reveals that your formula is selecting too broadly, you'll need to change the regular expression formula you are using to search, maybe to make it more specific or to separate it into two or more different formulas you can run one after the other.

I’ll be using Notepad++ in the examples below but the same formulas should work for you in text editors with regular expression support that are available on Macs. A good quick reference sheet for Regular Expressions has been created by MIT and it will be the one I use throughout this tutorial.

Using Regular Expressions in Notepad++

Learning Goals

If you want to get a large body of newspaper or scholarly article texts on a certain topic or from a certain date range to use text analysis on, you can use Proquest's batch download function to do so and save the whole thing as one file. That process is explained on a separate DAsH tutorial.

However, that file is going to be formatted to not just contain the full-text of those multiple articles. Metadata (a term meaning literally data about data) such as the title, author, subject, original publication and other information will be included both before and after each article as well as a line separator in between them. These headings are necessary if you just want to use the text file to take notes from when you are doing a close reading of each article. Otherwise there would be not other way to be able to identify who wrote what and where. But if all you want is the plain text of each article so you can analyze them as a unit, this extra information is something you'd want to cut out.

In this tutorial, we’ll take a file that contains plain text of articles downloaded in bulk from Proquest and edit it down so it just contains the article text without any of the information about publisher, links, author, etc that could get in the way of the text analysis you plan to do of the content of the articles. 

Getting Started

  • Download Notepad ++ if you do not have it already. It is a free text editor that has a lot of helpful tricks and can be used to write normal text as well as a variety of scripting formats. It also has lots of ways you can batch edit your document as you'll see now.  Unfortunately, it's not available for Mac but Atom or Sublime have similar functions
  • Locate the Proquest full text file that you want to alter, if you have one. For information on how to get one, see our exercise in how to get plain text from Proquest. 
  • If you did not get here from that previous exercise and don't have a batch file from Proquest that you want to edit, you can download a sample file I've created below.
    This file isn't really a list of articles with plain text from Proquest, but it does contain all the same extra information that one of those documents would have. The only thing I changed was the content. The full text came from a text generator online, and the titles, names and subjects, etc are just things I made up. The regular expressions formulas you learn for this, however, will be applicable to a real full text document from Proquest.

Identifying Patterns

Regular Expressions are a way of summing up the kind of words, numbers or blocks of text that we want to single out in a file based on what patterns those words, numbers or text blocks fit or not. In this case what we'll be summing up is what we want to delete. We can do this using a group of symbols that describes the pattern that these groups of text match. If we can identify the pattern of those irrelevant text blocks and figure out how to express that pattern to the program as a regular expression, the text editing program can do what would be repetitive editing work for us and take those sections out all at once.

  • Open your Proquest batch file that you want to work with or ProQuest_SampleFile in Notepad++.
  • Let's scroll through the text and see if there's a pattern to what the text before and after the full text of the article consists of. On my sample sheet, I have 10 articles which isn't a huge amount, so I'll have time to look through all of it. For larger sheets you may decide to only look at a portion of it before trying out some editing with regular expressions. 
  • The articles seem to uniformly follow this pattern. 
    • Line separator
    • The title of the article, but without any heading indicating it's the title
    • Multiple lines of metadata about the article below them. Each has a heading about what they are. In the sample file the first line after the title is almost always the one for Author: with the exception of the fact that for one of these entries, the first line after the title of the article is Publication Info: The amount of metadata lines between the title and the full article varies from article to article. 
    • The heading Full Text: right before the text of the article.
    • The full text of the article. This is the bit that you want to make sure is preserved in its entirety since it is the content that you want to analyze. 
    • 10-20something more lines of metadata about the article above them. The first one for all the ones on my sample file is Subject:
    • The line separator that starts the next article and it all begins over again.

From this look-through of the file, we have identified what the qualities are of the sections we need, but more importantly we have identified the pattern of everything else that we don't need.

  • Between the title and the full text of the article are a series of metadata lines that we are want to cut out, usually starting with Author:  and the line where the article starts begins with Full text: 
  • Right after the article there is a series of lines of metadata starting with one that says Subject: and everything right up to the line separator that marks the start of the article can go 

Now that we know what everything that we want to cut looks like, we just need to sum it up as a regular expression so that we can tell Notepad++'s Find and Replace function how to find these sections and replace them with nothing or with something else. 

Turning Identified Patterns Into Regular Expressions

  • Before you start editing, save the file as something different by going to File -> Save As. I'll go with the same file name but will add _Edit to the file name, so it'll be ProQuest_SampleFile_Edit. Getting careless with how loosely a regular expression is defined can result in making huge changes to a file, so if experimenting with an expression winds up altering a file than you meant to, this way, you'll have the original file to go back to.
  • Let's begin with the first section that we wanted to get rid of - everything after the article title and before the start of the article text. We are looking for a way to select only this section within each article. We know that this group of metadata lines between the title and the article usually begins with Author: and ends with Full text: Just to be sure that's the only place they are, let's do a search for each of those phrases to make sure they show up whenever the section to be cut begins and ends and nowhere else in the file. 
  • Go to Search on the menu bar and choose Find. You can also use Ctrl+F to do the same thing. 
  • In the Find window there are some options, you'll see that there are tabs for Find, the default when you are just looking for a term, Replace when you have something you want to replace it with, Find in Files which lets you search through files you don't currently have open in Notepad++ and Mark which will bookmark lines where the item you are searching for is found. We're just going to be using the first two tabs. 
  • On this first Find screen at the bottom there are some different options for Search Mode.
    • Normal: Most of the time up until now, you would have been searching with this mode in a text editor. The program takes what you type in literally, and doesn't take any of the input to be special characters or regular expressions
    • Extended: This means it will search for certain kinds of formatting characters in addition to whatever text you are searching for. So if you want to say, only find a phrase if it came at the end of a line or beginning of a tab. \n and \r mean a carriage return or line break, \t means a tab space. 
    • Regular expression: This is what you'll be learning to use today, you can use Regular expression mode to search for exact phrases just like normal search, but certain characters or combinations like \d or $ or . or ? it will read as being part of a regular expression.
  • Let's leave the mode as Normal for now since we are just trying to figure out if Author: is going to work as our search term to find the start of the section before the article that we want to cut.
  • Type in Author: as what you want to find and click Find All in Current Document
  • A new box will appear at the bottom of your document with your Find result. Promisingly, each hit is on their own line, meaning that they aren't within the middle of a sentence in the article. Less promisingly, if you're using the sample file, it's saying that there are 9 hits in this file, which, considering you have 10 articles, means one of your articles won't get targeted with a regular expression that starts with Author:, and you'll want to know what different phrase that article has on the line just below the title of the article. 
  • Since it looks like the biggest gap between two hits is between the first and second hits (Line 5 and Line 155), double-click on Author: in the first hit and scroll down.
  • For the second article Publication info: is the first metadata line underneath the title of the article. No problem, we'll just target this article separately with a second regular expression after we've written one for the phrasing used with most of these articles.
  • This is the same procedure you'd use if you have a larger text file than this. If the  amount of times that you expect to see a heading in the Find results doesn't align with the amount of times that you do: Look to see where there's a large line gap between results, double-click to go to the start of that gap and scroll down from there to see what different start point you can use to capture that different section. 
  • Now we know that with the exception of one article, all of the others start off the beginning of the section between title and article with Author: so we'll be able to make that the starting point for our regular expression. To set an ending point, we need to find out if all of them lead into the article by using Full text:
  • Type Full text: that into the search bar and choose Find All in Current Document
  • The Find results window pops up below the text and we can see that this appears 10 times and all at the start of a line, so we can be sure of our guess that this is the way to find the beginning of an article and end of the selection that we are cutting.
  • Double-click on any of the hits to make sure that they are in the place you want and when this is confirmed, it's time to move on to figuring out how to turn these starting and ending points of the section to be cut into a regular expression. 

We need to write an expression that will take Author: as a starting point, and match it and everything else between it and Full text: after which it will stop the selection. Let's check the cheat sheet made by MIT for regular expressions here and find some information that can help us.

The first table it tells us that means literally any character: number, word, space, punctuation, whatever. So we can use . as the symbol to substitute for any of the characters between Author: and Full text:,

But if we just use it once, it will only look for one character.

Put between Author: and Full Text:, it looks like this

Author:.Full text: 

and will find the first character between these two, but that one only. The number of lines and characters between these two points in the text varies greatly so we need to find a way to tell it to look for any number of characters as long as it stays between that start and end point

This is what we'll find in the "Quantifiers" section of the cheat sheet. This tells you what to add to the symbol or character we want the regular expression to find a match for then specify how many times you want that kind of character to appear in the match you're looking for. Here are the quantifiers that will help us for writing this regular expression:

  •  * stands for 0 or any number more repetitions of a symbol, which in our case we'll want to apply to the . between the phrases we are using as a starting and ending point. This means no matter the amount of characters between our start and end point, it will keep matching until it reaches the end of that ending phrase.
  • ? which means that the match will stop after that first time it reaches the ending point of the expression since it matches with 0 or 1 instances of the phrase before it. 

So if we wanted to write a regular expression that matched with Author: and whatever is between that and Full Text:  any time those phrases appeared in that order within the text file, it would look like this. 

Author:.*?Full text:

  • Let's make sure this regular expression works and go to the Find window, 
  • Select Regular expression as the Search Mode you want to be in and make sure that above it Wrap around is checked off and the check box for .matches newline is checked off too since our match will go across multiple lines.
  • Type in Author:.*?Full text: in the search bar and choose Find All in Current Document. It gives 9 results as expected
  • Let's double-click through an assortment of those hits to make sure that what's highlighted as a match is what we eventually want to edit out of the file. In this case, we'll see that what's highlighted as a match for our regular expression is in fact what we want to cut out and no more or less, so yay!

Now that we have confirmed that our regular expression matches what we want to edit out, let's replace what it has matched with blank space, in other words, deleting it.

  • Click the Replace tab to move to that window. It should automatically move over the last search that you did into the box after Find what:
  • In the box after Replace with: since you want that text to be replaced with nothing, leave that box blank. Then choose Replace All
  • There will be text at the bottom of the Replace window letting you know that it has replaced 9 occurrences of what you asked it to find. 
  • Scroll through, and you'll see that with the exception of the second article (the one where the first line after the title is Publication info: not Author:)  all of the articles now have the extra text between the article title and the article content deleted. 

Now let's take care of that one article that was the exception. Since it's just one, you could easily just delete it yourself manually, but let's say that this file had multiple times where Publication Info:  was the first line between the article title and article content. The neat thing about regular expressions, is when you've found one that works, it's very easy to customize that expression for other situations. 

We know that Author:.*?Full text: works, so to change the starting point of what the expression matches with from Author: to Publication Info: it's as easy as taking Author: out of the beginning of the regular expression and swapping in Publication Info: making it -
Publication Info:.*?Full text:

  • In the Find window, put Publication Info:.*?Full text: In the Find Next: box and choose Find All in Current Document
  • The only result will be that second article and it'll highlight that text between the article title and article content. 
  • Now that you've verified that the expression is finding what you want it to, go to the Replace window and do the same thing you did above.
  • Tell it to Find:  Publication Info:.*?Full text: and Replace with no text. Now, that section is deleted from this article too.  

Scroll through the document, now there is only the metadata after the article but before the next article left for us to get rid of. We know that the formula of Beginning Of Text to Cut .*? End of Text to Cut to set up our regular expression worked before, so let's try it again with this text. 

  • From a scroll through the document, it looks like after the article content ends, the next line starts with Subject: This makes it seem like that's a good starting point for our regular expression.
  • Let's test that if that's true by doing a search for Subject: Make sure that Match case  is checked. 
  • Go to the Find window and do a search for Subject: choosing Find All in Current Document. If you are using the sample file, you'll see that this time there are 11 hits, which is troubling because you know that you only have 10 articles.
  • Click to see where these results are placed and you'll see that one of these extra uses of Subject:  is just a second use of it within a section you want to cut anyway so that it will just wind up getting deleted anyway if we are making a cut point that starts with Subject: and ends right before the title of the next article.
  • Speaking of where to set the end point, that will be the line separator that occurs between the last line of metaadata from one article and the title of the next one so:  ____________________________________________________________ 
  • We just saw how to set up a regular expression so it does an inclusive match of everything between two phrases, just let's use the one we did before and swap in the new start and end points around the magic bit that matches with everything between the start and end points .*?
  • Open the Find window again. Make sure its still set to use Regular Expressions as the search mode,  but deselect Match Case
  • Your new expression you'll put in is your new start point - Subject: - the regular expression symbols for matching anything between your start and end point - .*? and your line separator which you'll copy directly from the file. So in Find What put in the below and choose Find All in Current Document
    Subject:.*?____________________________________________________________ 
  • In our sample file the results should be 10, click through them to make sure that the highlighted section matches what we want to delete. 

Now you're ready to get rid of the extra information at the bottom of the articles but in this instance, let's say you want to keep the line separator between articles so it'll be easier later to tell the divisions between the articles if you are curious after seeing your text analysis results. That'll change what we put in the Replace box.

  • Switch to the Replace window. For the Find what: box, paste in the regular expression you used to find the results you liked - Subject:.*?____________________________________________________________ 
  • In the Replace with: field, paste in a copy of the line separator. Now instead of replacing the text you didn't want with a blank space, it will replace it with a line separator.

  • Now choose Replace All and the document will shift to just contain the title and content of an article. 

There's still some spot cleaning to be done. It looks like one article still has Share: between the content of the article and the line separator marking the beginning of the next one, and at the very end of the document, there's some information about contacting Proquest, but the first one only occurs a few times and the second only once so they will be easy to edit out by hand.

This took some time to figure out the first time we've tried it, but now that you know a bit about regular expressions work, you can to adapt them to other text documents. You can even, if you have a bunch of different Proquest files that you know are set up like this, record and play back the same steps upon those files so you don't even have to do these steps more than the first time. This function is called a Macro.

Recording Your Steps as a Macro in Notepad++

Making a macro is a way of automating a series of repetitive tasks. Executing a macro is like giving Notepad++ a list of steps you want it to execute on a piece of text and having it do those tasks for you. If you know that you have a series of texts that you want to have the same kinds of actions performed on - like if instead of this one Proquest batch file you had a few dozen of them - you can record the steps that you want to be taken with each Proquest batch file, save that recording as a macro, and apply the macro to any number of other files. 

  • Go back to the original document, the one that still contained all the text you wanted to edit out. Save it as the file name plus _ForMacro at the end. In this case, mine will be ProQuest_SampleFile_ForMacro
  • We'll be setting up a macro recording and then going through each of the steps we figured out would clean this text file into something suitable for use in text analysis. This way Notepad++ will be able to save these steps, and we can apply them to other Proquest documents that we want to clean up. 
  • Go to the Macro item on the menu bar and choose Start Recording. Run through the steps we took to clean this text file while it is recording
    • Do a Find and Replace with  Regular Expression is selected for the search mode. Put Author:.*?Full text: in the Find What: box and nothing in the Replace with: box. Choose Replace All
    • Still with that window open in the Regular Expression search mode - put Publication Info:.*?Full text: in the Find What: box and in Replace with: is still blank. Choose Replace All
    • Lastly, search for Subject:.*?____________________________________________________________  and replace it with the line separator ____________________________________________________________
  • If you scroll through the document, you'll see that all of the edits that you automated with the regular expressions have been done. There are still a few that need to be done by hand, but where those edits are in a document will vary from file to file so you wouldn't want to record those deletions as part of the macro. 
  • Go back up to Macro and choose Stop Recording
  • Since you want to be using this macro again, from the Macro menu choose Save Currently Recorded Macro
  • A window will pop up asking you to give this macro a name and if you wanted you could even assign a shortcut to it. If you don't assign a shortcut, it'll still remain available to you in the Macro menu, which is good enough for me, so I'll just call this macro ProquestToTextOnly and select OK

We've now recorded a macro that automates all the steps that we figured out on how to use regular expressions to clean up a Proquest batch download file. To see what this macro does, let's run it on a file that needs to be edited.

  • Open up the original file that hasn't had any changes made to it.
  • Go to the Macro menu and select ProquestToTextOnly from the list of macros. You'll see that after you do so, it applies all those batch edits that we recorded to this file. 
  • This macro will remain in your Notepad++ program, and you can access it whenever you want. So once you've found something that works to edit a kind of file, you just need to create a macro out of it, and you can apply it over and over. 

Keep in mind when you are creating a macro, you want to be very sure that it isn't going to mess up the documents that you'll apply it to. You might want to try it on a few different examples of files you want to batch edit to make sure there aren't any unexpected consequences to the edits it applies that you don't like. In some cases, it might make sense to record a few smaller macros, rather than one large one and run them one after the other.

Learning Goals

Dialogue in film or television can be interesting material for text analysis. Sites such as Opensubtitles.org will let you download the subtitles (.srt) files for some movies or television series. Though an srt file doesn’t contain information about who is speaking which line of dialogue, if you want to analyze one movie, or a group of them using just the lines of dialogue spoken within them, the srt files can be a great source to draw from.

SRT files are designed to be used in conjunction with a piece of media and so along with the lines of dialogue, the file also contains the timecodes and order in which that dialogue should appear as subtitles when the video is played. While this additional numeric information is useful to your computer or other player being used to watch the movie, it isn’t useful to us if we want to analyze just the text. If this numeric information is still included in the file, our methods of analysis could be thrown off by the numbers or timecodes also present within the file.

Thankfully for us, because srt files are intended to be used by multiple different kinds of software, their format is standardized. This will make it easy for us to use regular expressions to find and delete the timecodes or other numeric indicators from the file and then apply the group of regular expression that we find to allow us to do that to multiple files.

In this tutorial, we’ll take an srt file, delete the timecode and other numeric data from it using regular expressions, reformat the irregular spacing within it, and turn it into a file just containing the text of the dialogue and captions within the movie.

Getting Started

  • Download Notepad ++ if you do not have it already. It is a free text editor that has a lot of helpful tricks and can be used to write normal text as well as a variety of scripting formats. It also has lots of ways you can batch edit your document as you'll see now.  Unfortunately, it's not available for Mac but Atom or Sublime have similar functions.
  • If you went to OpenSubtitles.org already to look for an SRT file that you want to analyze, please open that file now in Notepad++. Otherwise, download the file below from a 1920s movie and open it in Notepad++ (Note, that my version ends in .txt instead of .srt because otherwise it can't get uploaded to this guide. Yours that you get from OpenSubtitles.org will end in .srt)

Identifying Patterns

Regular Expressions are a way of summing up the kind of words, numbers or blocks of text that we want to single out in a file based on what patterns those words, numbers or text blocks fit or not. In this case what we'll be summing up is what we want to delete. We can do this using a group of symbols that describes the pattern that these groups of text match. SRT files have a standardized way of expressing the time and way that subtitles occur so there is a pattern that can be found in how those bits of numbers we want to take out are written. If we can identify the pattern of those irrelevant text blocks and figure out how to express that pattern to the program as a regular expression, the text editing program can do what would be repetitive editing work for us and take those sections out all at once.

  • Scroll through the LadyWindermeresFan_1925.txt file in Notepad++ to see what jumps out as material that isn't just the text of the dialogue that you'd like to take out. Also be on the lookout for formatting issues to be fixed. 
  • Here are the portions of the subtitles that I want to take out in order to make this document more understandable by text analysis programs
    • The running numbering system before each subtitle. It starts at 1, goes on the line right before the timecode and goes up one for every new subtitle block.
    • The timecode before each subtitle saying when in the movie it occurs and when it disappears off the screen.
    • The tags of <i> and </i> that go before and after certain sections to make them appear as italics on screen

A marking of where the text to be edited out appears

  • Here's some formatting issues that I'd like to solve
    • The empty lines between each subtitle
    • Sentences that are divided up by line breaks within one subtitle, or in some cases into separate subtitle blocks. Since a commonly tracked item within text analysis is a thing called n-grams (words that are used next to each other) if words that would normally be separated by a space are instead separated by a line break, that will eliminate some of the results we might otherwise get of common two, three or more word phrases. 

Now that we've decided which parts of the SRT file we want to get rid of or formatting we'd like to change, it's time to figure out what regular expressions we can use to match those parts so we can edit them as a batch. 

Writing Regular Expressions To Match Parts To Be Deleted

We've gone through and seen the pattern of the parts of this subtitle file that we want to cut out. Now it's time to figure out how to sum them up as regular expressions so we can get Notepad++ to cut them out of the document. 

  • Before we do anything to change our file, save the file we are' working with as a new version of original file. This way we have the old version to go back to if anything goes wrong. I'm going to save it as LadyWindermeresFan_1925_Edit.
  • First, let’s take a look at the timecode. 
    00:00:16,357 --> 00:00:29,324 is the first example, and by scrolling through the text file, we'll see that all the time codes are the same format. The first group of numbers is the cue for the exact millisecond the subtitle should show up on screen when the video is played, and the second group of numbers is the cue for the last millisecond it should disappear from the screen.

To translate that timecode into more basic and abstract terms (so we can sum it up as a regular expression),  it’s 2 digits, followed by a : then 2 more digits, followed by another : then 2 more digits, this time followed by a , then the last 3 digits and then an arrow made up of --> after which the sequence of digits, colons and a comma repeats once more. So we’ll need a way to summarize that pattern with a regular expression.

Fortunately, by consulting the cheatsheet linked to above we find a couple of things that can help us. For starters, there is a symbol that means a digit which is \d .That means we could just copy the code above and every time there is a digit we could substitute in \d which would work just fine, but there is an even shorter way to do it. 

There are also quantifiers that can be added to any symbol. They are much like the wildcards used in the exercise about the Proquest file. One way of using a quantifier is by placing a number directly in brackets next to the character. This will indicate you are trying to find exactly that many occurrences of that character next to each other. If you just search for \d it will find an individual digit. If you search for \d\d it will look for two digits next to each other and if you search for \d{2} it will do the same thing as \d\d, look for two numbers next to each other. 

Now, we should be able to just replace all the numbers in that time code with the regular expression marking out the number of digits that we want it to look for, but since there are other kinds of punctuation characters within the time code - (:,-> - we should double check that none of them are considered special characters. 

Special characters are characters that have a special meaning in regular expressions, but you'd recognize them the rest of the time as being characters often used as punctuation like . or ? or (). If special characters exist in the phrase that you are looking for you have to “escape” them with a slash. Fortunately none of the punctuation characters that are in the code :,-> are on the special characters list (though if the were inside brackets it would need to be escaped according to the special characters list). 

  • So, just take the time code example above:
    00:00:16,357 --> 00:00:29,324
  • Replace wherever there is a series of digits with the code for digit \d and a number in brackets indicating how many digits you want it to find in a row. Leave the rest of characters around it as they are and you’ll wind up with:
    \d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}
  • Now, let's see if this actually finds what we want it to. Open the Find window in Notepad++, make sure that you have Regular Expressions selected as the Search Mode, and then in the Find what: field paste in 
    \d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}
  • Choose Find All in Current document and scroll through the results. You should see that all of the results are formatted like the time code and that when you click from one hit to the next, you are skipping down to the next instance of time code. This will tell us that we've written the right regular expression since it (a) matches with everything you want it to match with and (b) doesn't match with anything we don't want it to match with.

Now that we're sure that the regular expression that we've written finds what we want it to do, time to use that fact to cut out what we no longer want in the file. 

  • Move over to the Replace tab of the window. Under Find what: make sure it has the regular expression that you've been using \d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3} and where it says Replace with: leave the box blank and choose Replace All 
  • This will eliminate the time codes from the document
  • Next, let's get rid of the running numbering for what number each subtitle is. To be sure of how we can do this, let's take a look at the formatting of the text. By clicking on the show all characters symbol at the top of the window, the one that looks like a backwards P, we'll see all the hidden aspects of the document, like what kind of line breaks and spacing are present.
  • At the end of each line there is a CR LF which means a line break and carriage return. More on that in the formatting section, but the main reason that we turned this on now is to see what the pattern is for the numbers. One of the things that the Show All Characters makes visible is where there is and is not spaces by putting a little orange dot in between where there are spaces between words. One place where that orange dot isn't...is between the number marking the beginning of a subtitle and the end of the line. 
  • Now, we already know that we can just look for digits, but if we just get rid of all the digits in the document in general, then it will also get rid of them where they might appear for totally legitimate reasons within the dialogue of the movie. But by turning on the other characters we can see that digits that are part of the numbering of the subtitles occur at the beginning of a line and then have no spaces between them and the end of the line. This would differentiate them from ones that may occur within a sentence of dialogue.
  • So from an inspection, it looks like the pattern that would find the sequence number would be to look for digits that occur at the beginning of a line, with no text or spaces between them and the end of the line
  • Going back to the MIT cheatsheet, we find there are special characters that can be used to indicate the beginning and ending of lines
    • ^ for the beginning of the line
    • $ for the ending of one
  • Similar to before we’ll create a phrase that sums up the pattern of what we want to find which is ^ the beginning of a line,  \d+  which will match with one or more repetitions of a digit, and $ for the ending of a line.
  • It will look like this
    ^\d+$
  • Let’s make sure this works by going to the Find window and plugging ^\d+$ into Find what: and then choosing Find All in Current Document. Make sure that Regular Expressions is still selected as the Search Mode.
  • By scrolling through the list of what was found, we confirm that our regular expression does match with each of the numbers that starts off a subtitle, and only those instances of numbers in the document. 
  • Now that we've verified that this will find us what we want to get rid of and nothing that we don't, let's go to Replace and put in our regular expression for the number at the beginning of a subtitle ^\d+$ where it says Find What and put in what we want to be in its place, which is nothing, in the Replace with box and choose Replace All
  • After this our document will just contain the text of the subtitles, no timecodes or numeric data, but there's one more thing we'll want to eliminate from this text - which is the tags to mark some of them as italics, <i>  and </i>. This batch edit actually pretty straightforward and you won't need any kind of regular expressions skills for it. 
  • Open the Find window and go to the Replace tab. Make sure the Search Mode is set to Normal as you're going back to searching simply for the literal text you want to find. In this case, put <i> in the Find what field and leave Replace with blank. Choose Replace all and watch the beginnings of the italics tags disappear
  • Follow the same procedure for the other half of an italics tag. Find </i> and replace it with nothing

Your file should now have only the text of the subtitles in it now, though there are a lot of empty lines and sometimes sentences are spread over multiple subtitles. We'll figure out how to change that next. 

Batch Find-And-Replace to Fix Formatting Problems

  • If you don't still have Show All Characters turned on, please make sure it is turned on now. It will help you find the patterns of what line breaks are where you want them to be, and which you want to eliminate.
  • Speaking of unnecessary line breaks, Notepad ++ actually has a function that will get rid of empty lines for you, which will make the formatting work that we will do next easier. Go to Edit on the menu bar, choose Line Operations and then select Remove Empty Lines
  • So let's look at the first group of lines of this text file and think about where we'd like the line breaks to go away. 

"LADY WINDERMERE'S FAN"
Lady Windermere faced the
grave problem-- of seating
her dinner guests.
Lord Darlington
"Lord Darlington."
"I presume you came
to see my husband."
Dear Lord Windemere--
This note from a total
stranger may surprise you--
...but is is important
that you see me at once
if you would avoid certain...
...unpleasant disclosures.
Yours very truly, Edith Erlynne.
"Lord Darlington is most
anxious to see you."
"You don't mind if
I run away?"

  • For this one it looks like there are some line breaks that occur in the middle of sentences. So we could decide to replace all line breaks with spaces instead. By experimenting on that smaller section of text, we'd get something like the below where the line breaks that were put in seemingly on purpose to differentiate between different speakers or different thoughts are also eliminated.

"LADY WINDERMERE'S FAN" Lady Windermere faced the grave problem-- of seating her dinner guests. Lord Darlington "Lord Darlington." "I presume you came to see my husband." Dear Lord Windemere-- This note from a total stranger may surprise you-- ...but is is important that you see me at once if you would avoid certain... ...unpleasant disclosures. Yours very truly, Edith Erlynne. "Lord Darlington is most anxious to see you." "You don't mind if I run away?"

  • Let's see what it looks like keep just the line breaks that occur after a . That gets us a little closer. However, items in quotes remain on the same line, and ellipses at the end of a line still have a line break after them, which doesn't make the below look like its the best way to organize the lines of dialogue.

"LADY WINDERMERE'S FAN"
Lady Windermere faced the grave problem-- of seating her dinner guests.
Lord Darlington
"Lord Darlington." "I presume you came to see my husband." Dear Lord Windemere-- This note from a total stranger may surprise you-- ...but is is important that you see me at once if you would avoid certain...
...unpleasant disclosures.
Yours very truly, Edith Erlynne.
"Lord Darlington is most anxious to see you." "You don't mind if I run away?"

  • Let's see what happens when we get rid of line breaks after ellipses but keep line breaks that occur after a single period and ones that occur after quotation marks. 

"LADY WINDERMERE'S FAN"
Lady Windermere faced the grave problem-- of seating her dinner guests.
Lord Darlington
"Lord Darlington." 
"I presume you came to see my husband."
Dear Lord Windemere-- This note from a total stranger may surprise you-- ...but is is important that you see me at once if you would avoid certain......unpleasant disclosures.
Yours very truly, Edith Erlynne.
"Lord Darlington is most anxious to see you." 
"You don't mind if I run away?"

  • This looks like much closer to a situation where sentences from different subtitles are rejoined onto one line, but separate statements stay separate. So let's use the find and replace tools to 
    • Delete line breaks after an ...
    • Preserve line breaks after a . or "
    • Delete all other line breaks
  • The good news is that line break can be found using Notepad++'s Extended functions rather than its Regular Expressions ones, so we can use that as the search mode, and don't have to worry about how to escape all these special characters.
  • The little icons that say CR LF at the end of each line are marking a line feed and carriage return. You can search for these within the text file using \r and \n in the Extended Search Mode. Basically it's just how the program represents the equivalent of pressing enter and moving onto a new line.
  • Give it a shot. Since the first thing we want to do is get rid of any line breaks that occur after an ellipses or ... let's find the matches for that. Go to the Find window, and make sure that Search Mode is set to Extended . In Find what:  write ...\r\n and choose Find All in Current Document 
  • From the search results, we're sure that this is the targeting we want. So since we want to get rid of the line break from that match or the \r\n part, let's go over to the Replace tab and for Find what:  enter in ...\r\n and Replace With: put in .... Then choose Replace all
  • Some of the line breaks have now disappeared. There is the unfortunate side effect that there were some places where getting rid of a line break after an ellipses meant that two ellipses joined together
    "I am sorry my maid forgot
    to empty the ash tray......but it happens even
    in society."
  • You can fix this pretty easily. Just go back to that Replace window and where it says Find what: put in ...... and select that you want to replace it with ...
  • We are part of the way to the reduced number of line breaks that we wanted. Next we want to preserve the line breaks that occur after a . or after a " but get rid of the others. The simplest thing to do is to mark in some way which line breaks are the ones we'd like to preserve rather than figure out every other letter or other character configuration might exist before a line break you don't want. That way we can get rid of all line breaks, but still be able to put back in the ones we want.
  • You'll do this by marking the periods and quotation marks that occur before a line break with a pipe or |. This makes it so when you get rid of all the line breaks the | will still mark where you want line breaks to go back in. I picked that mark | because it very rarely appears in a document. Just to be sure, let's do a search for it before we use it. Go to the Find window and under Find what: put in |  and do a search. It will tell you it cannot find that text. So we're good to use it as a placeholder without messing up anything original to the text.
  • Go back to the Find window and look for all the instances where a period is followed by a line break by putting in .\r\n in the Find what:  section. Click through a sampling of the results to make sure they match with areas that we'd like to have a line break remain
  • Next, mark it by going to the Replace window and adding .\r\n in the Find what: window and replacing it with .|\r\n
  • We'll follow the same pattern for the " that comes before a line break. First, go to the Find window and search for all the instances of "\r\n to make sure it finds what you want it to, and only what you want it to. Once you've confirmed this, go to the Replace window, enter "\r\n for Find what: and "|\r\n for Replace with and you'll see that now throughout the text the line breaks that you want to keep are now marked with a |
  • A few last steps on the line breaks. Now that we've put in in the | as a bookmark for where we want to put back in certain line breaks later, let's go to the find and replace window and replace all line breaks with a space. Put in \r\n where it says Find what: and a space where it says Replace with: Choose to Replace All  and now your whole document is contained within a single line. Don't worry, this is only temporary.
  • You'll notice that periodically throughout that single line there is the | symbol. This is where you marked where the line breaks should go. Put them back in with Find what: and Replace with: \r\n
  • You'll see that the document is now sorted in a much more orderly way. You may still have to do some spot editing, but overall, this took a document with lots of unnecessary text and formatting and is now much more readable. 

Recording Your Steps as a Macro in Notepad++

A macro is a way of automating a series of repetitive tasks, it's like giving Notepad++ a list of steps you want it to execute on a piece of text. If you know that you have a series of texts that you want to have the same kinds of actions performed on, like say, if instead of this one subtitle file you had a few dozen of them, you can record the steps that you want to be taken with each document one time, save that recording as a macro, and apply it to any number of other documents. 

  • Go back to the original document, the one that still contained all the text you wanted to edit out. Save it as the file name plus _ForMacro at the end. In this case, mine will be LadyWindermeresFan_1925_ForMacro
  • We'll be setting up a macro recording and then going through each one of the steps we figured out on how to clean this text file into something suitable for use in text analysis. This way Notepad++ will be able to save these steps, and we can apply them to other subtitle documents that we want to clean up. 
  • Go to the Macro item on the menubar and choose Start Recording
  • The first step you took to clean this file was to get rid of all the time code by using a regular expression to find and replace with nothing, so go to the Replace window,  select Regular Expression as a search mode and choose to Find \d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3} and Replace it with nothing
  • With Regular Expression still selected as the search mode, chose to Find ^\d+$ and Replace with nothing
  • Lastly for the text edits, switch the Search Mode to Normal and Find <i> and replace it with nothing. Repeat this with </i>.
  • Now the beginning of the formatting changes that we made, the first thing was to get rid of all the empty lines. We did this by going to Edit, selecting Line Operations and choosing  Remove Empty Lines
  • Get rid of line breaks after an ellipses by switching the Search Mode to Extended and searching to Find ...\r\n  and  Replace with ...
  • Clean up the double ellipses that ensued at times by Finding ...... and Replacing with ....
  • Mark the line breaks you want to preserve after a period by searching for .\r\n and Replacing with .|\r\n
  • Mark the line breaks you want to preserve after a quotation by searching for "\r\n and Replacing with "|\r\n
  • Get rid of all line breaks by searching for \r\n and replacing it with a space.
  • Finally put back in the line breaks you wanted to preserve by replacing with \r\n to add back in the line break. Your subtitle file should now look just like the one that we created previously
  • Go up to Macro and choose Stop Recording
  • In order to be able to use this on later occasions, you'll have to save it. Go to Macro and choose Save Currently Recorded Macro
  • Shortcut menu will pop up. Give it the name Cleaning a Subtitle File and choose OK. If you think this is something you'll use a lot, you can even give it a shortcut. I won't be doing that however. 
  • To prove it works, go to the original subtitle file again of LadyWindermeresFan_1925 and go to the Macro menu and choose Cleaning a Subtitle File from the list of macros. You'll see that the file gets cleaned up just like the version of it you already created. That means any file you apply the macro to that is set up like this one (as most subtitle files will be) will function the same way. 
  • Exit this document without saving it.  Your macro will remain available in Notepad++ for you to use on later subtitle files. 

Keep in mind when you are creating a macro, you want to be very sure that it isn't going to mess up the document since executing it carries out all the steps at once rather than with breaks in between for you to assess how the edit looks. You might want to try the macro on a few different examples of files you want to batch edit to make sure there isn't any bugs in the way that it is set up. In some cases, it might make sense to make a few smaller macros rather than one large one and run them one after the other in some cases.