Last modified: Friday, 23-Feb-2024 19:59:22 UTC. Authored by: David J. Birnbaum (djbpitt at gmail.com). Edited and maintained by: Elisa E. Beshero-Bondar (eeb4 at psu.edu). Powered by firebellies.

Regex Exercise: Convert the text of Shakespeare’s sonnets into XML

Consult the following resources as you work with Regular Expressions:

Our newtFire tutorial on Autotagging with Regular Expressions (Regex)
Regular-Expressions.info Tutorial: a mine of helpful detail on regular expression matching,

Get the source text ready in <oXygen/>

The text we’ll be using as input for the first regex homework assignment is a plain-text version of Shakespeare’s sonnets produced by Project Gutenberg, which you can download from our site here: shakeSonnets.txt.

To download the file, go to File and Save as in your web browser, and choose a useful name and location on your computer to save the file. We typically keep the .txt extension, and you might rename this as YourName_Regex1_sonnets.txt.)
Then open <oXygen/>, and open the file you saved.
Delete the lengthy Project Gutenberg publishing information from the beginning and end of the sonnets file, so that what you’re left with is just the sonnets in order, with roman numerals before each one.

Prepare a Step File

Next, open a new, separate text file, in which you will record each step you take in up-converting this document to XML. This needs to be a plain text (*.txt) or markdown (*.md) file and not something you write in a word processor (not a Microsoft Word document) so you do not have to struggle with autocorrections of the regex patterns you are recording.
Save this file as your main homework submission for this assignment, following our standard homework file naming conventions for upload to Canvas. We will duplicate the steps you record to make sure they work to up-convert the text file to XML. Suggestions: You can open a new plain text file in <oXygen/> by going to open a New document (the folded piece of paper icon) and typing in "text" in the search bar. On Windows, you can find and open Notepad and record your steps in plain text form here outside of oXygen, which may be convenient, so you don’t accidentally try your find-and-replace operations on your step file instead of the main text. On Mac, you might try TextEdit, or stick with <oXygen/> and open your window in Tile View as we did with your Relax NG Schema files.

The task

Your goal is to produce an XML version of the Shakespeare Sonnets file by using the search-and-replace techniques we discussed in class, and record each step you take in a plain text or markdown file so others can reproduce exactly what you did. (You may, in a real-life project situation, need to share the steps you take in up-converting plain text documents to XML, and share that on your GitHub repo in GitHub’s markdown (the same that we write on the GitHub Issues board), and in that case you would save the file with a .md extension.

Your up-converted XML output should look something like http://dh.obdurodon.org/shakespeare-sonnets.xml. That is, each sonnet should be its own element, each line should be tagged separately, and the roman numerals should be encoded in a useful way (we’ve used attributes, but you could also put them in a child element).

        <xml>
           <sonnet number="I">
               <line>From fairest creatures we desire increase,</line>
               <line>That thereby beauty's rose might never die,</line>
               <line>But as the riper should by time decease,</line>
               <line>His tender heir might bear his memory:</line>
               <line>But thou contracted to thine own bright eyes,</line>
               <line>Feed'st thy light's flame with self-substantial fuel,</line>
               <line>Making a famine where abundance lies,</line>
               <line>Thy self thy foe, to thy sweet self too cruel:</line>
               <line>Thou that art now the world's fresh ornament,</line>
               <line>And only herald to the gaudy spring,</line>
               <line>Within thine own bud buriest thy content,</line>
               <line>And tender churl mak'st waste in niggarding:</line>
               <line>Pity the world, or else this glutton be,</line>
               <line>To eat the world's due, by the grave and thee.</line>
           </sonnet>
           <sonnet number="II">
               <line>When forty winters shall besiege thy brow,</line>
               <line>And dig deep trenches in thy beauty's field,</line>
               <line>Thy youth's proud livery so gazed on now,</line>
               <line>Will be a tatter'd weed of small worth held:</line>
               <line>Then being asked, where all thy beauty lies,</line>
               <line>Where all the treasure of thy lusty days;</line>
               <line>To say, within thine own deep sunken eyes,</line>
               <line>Were an all-eating shame, and thriftless praise.</line>
               <line>How much more praise deserv'd thy beauty's use,</line>
               <line>If thou couldst answer 'This fair child of mine</line>
               <line>Shall sum my count, and make my old excuse,'</line>
               <line>Proving his beauty by succession thine!</line>
               <line>This were to be new made when thou art old,</line>
               <line>And see thy blood warm when thou feel'st it cold.</line>
           </sonnet>
           ...
       </xml>

Your Steps file needs to be detailed enough to indicate each step of your process: what regular expression patterns you attempted to find, and what expressions you used to replace them. You might record the number finds you get and even how you fine-tuned your steps when you were not finding everything you wanted to at first. Note: we strongly recommend copying and pasting your find and replace expressions into your Steps file instead of retyping them (since it is easy to introduce errors that way).

How to proceed

There are several ways to get to the target output, but the starting points are standard:

Starting work:

First of all, for any up-conversion of plain text, you must check for the special reserve characters: the ampersand & and the angle brackets < and >. You need to search for those and, if they turn up, replace them with their corresponding XML entities, so that these will not interfere with well-formed XML markup.

Search for:	Replace with:
`&`	`&`
`<`	`<`
`>`	`>`

Note that you need to process the special XML reserve characters in the correct order. Why is it important that you search and replace the & first?

Don’t worry about the title and author at the top of the file just yet. You will eventually tag them by hand, and we recommend just doing that at the end of the up-conversion process. You’ll be using <oXygen/>’s global Find-and-Replace tool to tag the sonnets, and if you leave the title and author in place while you do that, you’ll wind up tagging them incorrectly. That isn’t a problem as long as you remember to fix them manually at the end. Or you could remove them now to another file to paste them back in at the end of the regex autotagging process.

To perform regex searching, you need to check the box labeled Regular expression at the bottom of the <oXygen/> find-and-replace dialog box, which you open with Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/> will just search for what you type literally, and it won’t recognize that some characters in regex have special meaning. You don’t have to check anything else yet. Be sure that Dot matches all is unchecked, though; we’ll explain why below.

Leading space characters

The non-blank lines all begin with space characters: there are two spaces before most lines (the Roman numerals and the first twelve lines of each sonnet) and four spaces before the last two lines of every sonnet. Those spaces are presentational formatting, and not part of the content of the text, and since we don’t need them in order to tag the text, we’ll start by deleting them. The regex to match a space character is just a space character, and you can match one or more space characters by using the plus sign repetition indicator. To match one or more instances of the letter X, you would use a regex like X+. To match one or more instances of a space character, just replace the X with a space.

You don’t want to remove all space characters, though; you just want to remove the ones at the beginning of a line. You can do that by using the caret metacharacter, which anchors a match so that it succeeds only at the beginning of a line. For example, if the regex X+ matches one or more instances of X, the regex ^X+ matches one or more instances of X only at the beginning of line. You can use this information to match one or more space characters at the beginning of a line and replace them with nothing, that is, delete them.

We can always choose whether to work with blank lines or not. For our purposes in this exercise, we do not need them, so you can delete them if you’d like, or you can leave them in place to enhance the legibility. To delete them, you need to match a blank line, and the easiest way to do that is to match two new line characters in a row and replace them with a single new line character. The regex for a new line character is \n. Try it.

Inside out or outside in?

We can create our markup either from the outside in (document, then sonnet, then divide the sonnet into Roman numeral and lines) or from the inside out (lines and Roman numeral, then wrap those in a sonnet, then wrap all of the sonnets in a document). Either strategy can be made to work, but we generally find it easier to work from the inside out. (When we work from outside in, it’s easy to wind up incorrectly wrapping <line> tags around the <sonnet> start and end tags, etc.)

Lines

We’ll start by tagging every line as a <line>. This will erroneously tag the Roman numerals as if they were lines of poetry, which they aren’t, but since we're using the inside-out method, we are just planning to correct those Roman numeral lines later.

We don’t want to tag any blank lines (if we left them in), though, so we need a regex that matches only lines that have characters in them. Check your <oXygen/> Find / Replace setup: make sure that Dot matches all is unchecked! In this mode only, the dot (.) matches any character except a new line, which means that we can use the plus sign repetition indicator to match one or more instances of any character except a new line (that is, .+). By default regex selects the longest possible match, so even though just two characters on a line will match the pattern, when we run it it will always match the entire line. Since the dot matches any character except a new line, the regex will match each line individually, that is, it won’t run over a new line and continue the same match. Try it and examine the results. Now check Dot matches all, run Find all, and look at those results. Notice that the match no longer stops at the end of the line, and since you want to tag each line individually, you need to uncheck that box to revert to the normal, default behavior.

A human might think of our task as wrap every line in <line> tags, but regex has a find-and-replace view of the world, so a regex way to think about it would be match every line, delete it, and replace it with itself wrapped in <line> tags. That is, regex doesn’t think about leaving the line in place and inserting something before and after it; it thinks about matching the line, deleting it, and then putting it back, but with the addition of the desired tags. The regex selects and matches each full line, but how do we write what we selected into the replacement string? The answer is that the sequence \0 in the replacement pattern means the entire regex match, and you can use that to write the matched line back into the replacement, but wrapped in <line> tags. Try it.

Roman numerals

The Roman numerals are now erroneously tagged as if they were lines of poetry, and in our sample output at http://dh.obdurodon.org/shakespeare-sonnets.xml we want them to be attribute values. To start that process we need to think about how to distinguish a Roman numeral line from a real line of poetry. Since there are 154 sonnets, a Roman numeral line is a line that contains one or more instances of I, V, X, L, and C in any order and nothing else, and no real line of poetry matches that pattern. That means that we can match that pattern by using a regex character class, which you can read about at http://www.regular-expressions.info/charclass.html. This approach will match sequences that aren’t Roman numerals, like XVX, but those don’t occur, so we don’t have to worry about them. This illustrates a useful strategy: a simple regex that overgeneralizes vacuously may be more useful than a complex one that avoids matching things that won’t occur anyway. You can use the character class (wrapped in square brackets) followed by a plus sign (meaning one or more) to complete your regex so that it matches only <line> elements that contain a Roman numeral and nothing but a Roman numeral. Try it.

In this case you want to write the Roman numeral into the replacement string, but you want to get rid of the spurious <line> tags and replace them with other markup. \0 will write the entire match into the replacement, but that would include the original <line> tags that you want to remove. To capture part of a regex match, you wrap it in parentheses; this doesn’t match parenthesis characters, but it does make the part of the regex that’s between the parentheses available for reuse in the replacement string. For example, a(b)c would match the sequence abc and capture the b in the middle, so that it could be written into the replacement. Capturing a single literal character value isn’t very useful because you could have just written the b into the replacement literally, but you can also capture wildcard matches. For example, a(.)c matches a sequence of a literal a character followed by any single character except a new line followed by a literal c character, and you can use that information to capture everything between the <line> tags in the matched string. To write a captured pattern into the replacement, use a backslash followed by a digit, where \1 means the first capture group, \2 means the second, etc. (and in this case you’re capturing only one group). We’d build a replacement string that starts with a </sonnet> end tag, then a new line, and then a <sonnet> start tag, including the @number attribute and using the captured string as its value, etc. Try it.

Clean up

You may have to clean up the beginning and end of the document manually, including the title and author, and you’ll also need to add a root element.

Checking your results

Although you’ve added XML markup to the document, <oXygen/> remembers that you opened it as plain text, which means that you can’t check it for well-formedness. To fix that, save it as XML with File → Save as and give it the extension .xml. Even that doesn’t tell <oXygen/> that you’ve changed the file type, though; you have to close the file and reopen it. When you do that, <oXygen/> now knows that it’s XML, so you can verify that it’s well formed in the usual way: Control+Shift+W on Windows, Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar at the top and choose Check well-formedness.

General

As we mention above, there are several ways to get to the target output, and whatever works is legitimate, as long as you make meaningful use of computational tools, including regular expressions (where appropriate), and don’t just tag everything manually. As you saw in class, there are ways to build your own regular expressions to match whatever patterns you need to identify, and the regex languages is complex and often difficult to read. The way we would approach this task is by figuring out what we need to match and then looking up how to match it. In addition to the mini-tutorial above, there is a more comprehensive description in the regex section of Michael Kay’s book and more detailed tutorial information at http://www.regular-expressions.info/tutorialcnt.html. If you decide to look around for alternative reference sites and find something that seems especially useful, please post the URL on the discussion boards, so that your classmates can also consult it.

What to submit

the original source text file you started with
a step file as a markdown (.md) or plain text (.txt) document (a step-by-step description of what you did), and
your results file (the XML document as .xml)

If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As you are working on this, post any questions on Slack or our class GitHub Issues board!