 Last modified:
        Friday, 23-Feb-2024 19:59:22 UTC. Authored by:  David J. Birnbaum (djbpitt at gmail.com). Edited and maintained by: Elisa E. Beshero-Bondar
        (eeb4 at psu.edu). Powered by firebellies.
 Last modified:
        Friday, 23-Feb-2024 19:59:22 UTC. Authored by:  David J. Birnbaum (djbpitt at gmail.com). Edited and maintained by: Elisa E. Beshero-Bondar
        (eeb4 at psu.edu). Powered by firebellies.Consult the following resources as you work with Regular Expressions:
The text we’ll be using as input for the first regex homework assignment is a plain-text version of Shakespeare’s sonnets produced by Project Gutenberg, which you can download from our site here: shakeSonnets.txt.
.txt extension, and you might rename this as YourName_Regex1_sonnets.txt.)
           Step File
*.txt) or markdown (*.md) file and not something you write in a word processor (not a Microsoft Word document) so you do not have to struggle with autocorrections of the regex patterns you are recording.Your goal is to produce an XML version of the Shakespeare Sonnets file by using the search-and-replace techniques we discussed in class, and record each step you take in a plain text or markdown file so others can reproduce exactly what you did. (You may, in a real-life project situation, need to share the steps you take in up-converting plain text documents to XML, and share that on your GitHub repo in GitHub’s markdown (the same that we write on the GitHub Issues board), and in that case you would save the file with a .md extension.
Your up-converted XML output should look something like http://dh.obdurodon.org/shakespeare-sonnets.xml. That is, each sonnet should be its own element, each line should be tagged separately, and the roman numerals should be encoded in a useful way (we’ve used attributes, but you could also put them in a child element).
        <xml>
           <sonnet number="I">
               <line>From fairest creatures we desire increase,</line>
               <line>That thereby beauty's rose might never die,</line>
               <line>But as the riper should by time decease,</line>
               <line>His tender heir might bear his memory:</line>
               <line>But thou contracted to thine own bright eyes,</line>
               <line>Feed'st thy light's flame with self-substantial fuel,</line>
               <line>Making a famine where abundance lies,</line>
               <line>Thy self thy foe, to thy sweet self too cruel:</line>
               <line>Thou that art now the world's fresh ornament,</line>
               <line>And only herald to the gaudy spring,</line>
               <line>Within thine own bud buriest thy content,</line>
               <line>And tender churl mak'st waste in niggarding:</line>
               <line>Pity the world, or else this glutton be,</line>
               <line>To eat the world's due, by the grave and thee.</line>
           </sonnet>
           <sonnet number="II">
               <line>When forty winters shall besiege thy brow,</line>
               <line>And dig deep trenches in thy beauty's field,</line>
               <line>Thy youth's proud livery so gazed on now,</line>
               <line>Will be a tatter'd weed of small worth held:</line>
               <line>Then being asked, where all thy beauty lies,</line>
               <line>Where all the treasure of thy lusty days;</line>
               <line>To say, within thine own deep sunken eyes,</line>
               <line>Were an all-eating shame, and thriftless praise.</line>
               <line>How much more praise deserv'd thy beauty's use,</line>
               <line>If thou couldst answer 'This fair child of mine</line>
               <line>Shall sum my count, and make my old excuse,'</line>
               <line>Proving his beauty by succession thine!</line>
               <line>This were to be new made when thou art old,</line>
               <line>And see thy blood warm when thou feel'st it cold.</line>
           </sonnet>
           ...
       </xml>
       
       Your Steps file
 needs to be detailed enough to indicate each step of your process: what regular expression patterns you attempted to find, and what expressions you used to replace them. You might record the number finds you get and even how you fine-tuned your steps when you were not finding everything you wanted to at first. Note: we strongly recommend copying and pasting your find and replace expressions into your Steps file instead of retyping them (since it is easy to introduce errors that way).
There are several ways to get to the target output, but the starting points are standard:
First of all, for any up-conversion of plain text, you must check for the special reserve characters: the ampersand & and the angle brackets < and >. You need to search for those and, if they turn up,
           replace them with their corresponding XML entities, so that these will not interfere with well-formed XML markup.
| Search for: | Replace with: | 
|---|---|
| & | & | 
| < | < | 
| > | > | 
Note that you need to process the special XML reserve characters in the correct order. Why is it important that you search and replace the & first?
Don’t worry about the title and author at the top of the file just yet. You will eventually tag them by hand, and we recommend just doing that at the end of the up-conversion process. You’ll be using <oXygen/>’s global Find-and-Replace tool to tag the sonnets, and if you leave the title and author in place while you do that, you’ll wind up tagging them incorrectly. That isn’t a problem as long as you remember to fix them manually at the end. Or you could remove them now to another file to paste them back in at the end of the regex autotagging process.
To perform regex searching, you need to check the box labeled Regular expression
           at the bottom of the <oXygen/> find-and-replace dialog box, which you open with
           Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/>
           will just search for what you type literally, and it won’t recognize that some
           characters in regex have special meaning. You don’t have to check anything else yet. Be
           sure that Dot matches all
 is unchecked, though; we’ll explain why below.
The non-blank lines all begin with space characters: there are two spaces before most
           lines (the Roman numerals and the first twelve lines of each sonnet) and four spaces
           before the last two lines of every sonnet. Those spaces are presentational formatting,
           and not part of the content of the text, and since we don’t need them in order to tag
           the text, we’ll start by deleting them. The regex to match a space character is just a
           space character, and you can match one or more space characters by using the plus sign
           repetition indicator. To match one or more instances of the letter X
, you would
           use a regex like X+. To match one or more instances of a space character,
           just replace the X
 with a space.
You don’t want to remove all space characters, though; you just want to remove the ones
           at the beginning of a line. You can do that by using the caret metacharacter, which
           anchors a match so that it succeeds only at the beginning of a line. For example, if the
           regex X+ matches one or more instances of X
, the regex
           ^X+ matches one or more instances of X
           only at the beginning of line. You can use this information to match one or
           more space characters at the beginning of a line and replace them with nothing, that is,
           delete them.
We can always choose whether to work with blank lines or not. For our purposes in this exercise, we do not need them, so you can delete them if you’d like, or you can leave them in place to enhance the legibility. To delete them, you need to match a blank line, and the easiest way to do that is to match two new line characters in a row and replace them with a single new line character. The regex for a new line character is \n. Try it.
We can create our markup either from the outside in (document, then sonnet, then divide
           the sonnet into Roman numeral and lines) or from the inside out (lines and Roman
           numeral, then wrap those in a sonnet, then wrap all of the sonnets in a document).
           Either strategy can be made to work, but we generally find it easier to work from the
           inside out. (When we work from outside in, it’s easy to wind up incorrectly
           wrapping <line> tags around the <sonnet> start and
           end tags, etc.)
We’ll start by tagging every line as a <line>. This will erroneously
           tag the Roman numerals as if they were lines of poetry, which they aren’t, but since we're using the inside-out method, we are just planning to correct those Roman numeral lines later.
We don’t want to tag any blank lines (if we left them in), though, so we need a regex that
           matches only lines that have characters in them. Check your <oXygen/> Find / Replace setup: make sure that Dot matches all
 is unchecked! In this mode only, the dot (.) matches any character except a new line, which means that we can use the plus sign
           repetition indicator to match one or more instances of any character except a new line
           (that is, .+). By default regex selects the longest possible match, so even
           though just two characters on a line will match the pattern, when we run it it will
           always match the entire line. Since the dot matches any character except a new line, the
           regex will match each line individually, that is, it won’t run over a new line and
           continue the same match. Try it and examine the results. Now check Dot matches
               all
, run Find all, and look at those results. Notice that the match no longer
           stops at the end of the line, and since you want to tag each line individually, you need
           to uncheck that box to revert to the normal, default behavior. 
A human might think of our task as wrap every line in 
, but regex has a find-and-replace view of the world, so a regex way to
           think about it would be <line>
           tagsmatch every line, delete it, and replace it with itself
               wrapped in 
. That is, regex doesn’t think about
           leaving the line in place and inserting something before and after it; it thinks about
           matching the line, deleting it, and then putting it back, but with the addition of the
           desired tags. The regex selects and matches each full line, but how do we write what we
           selected into the replacement string? The answer is that the sequence <line> tags\0 in
           the replacement pattern means the entire regex match
, and you can use that to
           write the matched line back into the replacement, but wrapped in
           <line> tags. Try it.
The Roman numerals are now erroneously tagged as if they were lines of poetry, and in our
           sample output at http://dh.obdurodon.org/shakespeare-sonnets.xml we want them to be attribute
           values. To start that process we need to think about how to distinguish a Roman numeral
           line from a real line of poetry. Since there are 154 sonnets, a Roman numeral line is a
           line that contains one or more instances of I
, V
, X
, L
, and
           C
 in any order and nothing else, and no real line of poetry matches that
           pattern. That means that we can match that pattern by using a regex character
               class, which you can read about at http://www.regular-expressions.info/charclass.html. This approach will match
           sequences that aren’t Roman numerals, like XVX
, but those don’t occur, so we
           don’t have to worry about them. This illustrates a useful strategy: a simple regex that
           overgeneralizes vacuously may be more useful than a complex one that avoids matching
           things that won’t occur anyway. You can use the character class (wrapped in square
           brackets) followed by a plus sign (meaning one or more) to complete your regex so that
           it matches only <line> elements that contain a Roman numeral and
           nothing but a Roman numeral. Try it.
In this case you want to write the Roman numeral into the replacement string, but you
           want to get rid of the spurious <line> tags and replace them with
           other markup. \0 will write the entire match into the replacement, but that
           would include the original <line> tags that you want to remove. To
           capture part of a regex match, you wrap it in parentheses; this doesn’t match
           parenthesis characters, but it does make the part of the regex that’s between the
           parentheses available for reuse in the replacement string. For example,
           a(b)c would match the sequence abc
 and capture the b
 in
           the middle, so that it could be written into the replacement. Capturing a single literal
           character value isn’t very useful because you could have just written the b
 into
           the replacement literally, but you can also capture wildcard matches. For example,
           a(.)c matches a sequence of a literal a
 character followed by
           any single character except a new line followed by a literal c
 character, and you
           can use that information to capture everything between the <line>
           tags in the matched string. To write a captured pattern into the replacement, use a
           backslash followed by a digit, where \1 means the first capture group,
           \2 means the second, etc. (and in this case you’re capturing only one
           group). We’d build a replacement string that starts with a </sonnet>
           end tag, then a new line, and then a <sonnet> start tag, including
           the @number attribute and using the captured string as its value, etc. Try
           it.
You may have to clean up the beginning and end of the document manually, including the title and author, and you’ll also need to add a root element.
Although you’ve added XML markup to the document, <oXygen/> remembers that you
           opened it as plain text, which means that you can’t check it for well-formedness. To fix
           that, save it as XML with File → Save as and give it the extension .xml. Even
           that doesn’t tell <oXygen/> that you’ve changed the file type, though; you have to
           close the file and reopen it. When you do that, <oXygen/> now knows that it’s XML,
           so you can verify that it’s well formed in the usual way: Control+Shift+W on Windows,
           Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar
           at the top and choose Check well-formedness
.
As we mention above, there are several ways to get to the target output, and whatever works is legitimate, as long as you make meaningful use of computational tools, including regular expressions (where appropriate), and don’t just tag everything manually. As you saw in class, there are ways to build your own regular expressions to match whatever patterns you need to identify, and the regex languages is complex and often difficult to read. The way we would approach this task is by figuring out what we need to match and then looking up how to match it. In addition to the mini-tutorial above, there is a more comprehensive description in the regex section of Michael Kay’s book and more detailed tutorial information at http://www.regular-expressions.info/tutorialcnt.html. If you decide to look around for alternative reference sites and find something that seems especially useful, please post the URL on the discussion boards, so that your classmates can also consult it.
.md) or plain text (.txt)
               document (a step-by-step description of what you did), and.xml)If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As you are working on this, post any questions on Slack or our class GitHub Issues board!