Regular Expression Basics With Egrep
In this post we will introduce ourselves to regular expressions.
Regular expressions are composed of two kinds of characters:
- Literal characters (normal text characters)
- Special characters (metacharacters)
Regular expressions can contain other regular expressions, i.e. subexpressions. Subexpressions can be arbitrarily nested (i.e. also contain subexpressions) and complex.
Metacharacters can have a different meaning depending on the context.
Further, a character can be sometimes a metacharacter and sometimes a normal text character, depending on the context.
To use regular expressions, a host is required. The host can be anything
from a command line utility (like egrep
) to a full blown programming
language, e.g. Python.
There are different flavours of regular expressions, where the exact meaning of the characters, syntax, etc. varies slightly (or not so) between flavours.
Egrep
egrep
is a command line utility for searching plain text data sets for
lines that match a regular expression. All examples in this post
use egrep
and refer to the flavour of regular expression that egrep
uses.
The user gives egrep
a regular expression (regex) and some files to
search; egrep
attempts to match the regex to each line of each file, displaying
only those lines in which a match is found.
First examples
Create file.txt
with the following contents
Hello.
How are you?
I'm fine thanks. How are you?
I'm well, thank you.
We are going to search for the lines containing How are you?
egrep -no 'How are you\?' file.txt
which displays in the shell (usually with some highlighting, depending on your configuration)
2:How are you?
3:How are you?
Alternatively, the same example can be done without creating a file
TEXT="Hello.\nHow are you?\nI'm fine thanks. How are you?\nI'm well, thank you."
printf "$TEXT" | egrep -no 'How are you\?'
Here is another example,
egrep '^(From|Subject): ' mailbox-file
mailbox-file
is the filename and the single quotes ''
are used to wrap
around the regex so that the shell knows not to intepret any of the
characters inside them in a special way, i.e. the text between them
constitutes an argument to the command egrep
.
The regex metacharacters in the above example are ^
, (
, |
, and )
.
Metacharacters
Without metacharacters, regex is not very interesting, e.g. if your regex is
abc
, then all you will get as matches are lines containing the character
sequence abc
.
The utility starts with metacharacters. We already saw in the email
example the metacharacters ^
, (
, |
, and )
.
We will take a closer look at metacharacters, but first we will take a quick look at character classes.
Character classes
A character class matches any one of several characters at that point in the match.
A character class is denoted via squared brackets, e.g. [ea]
is a character class.
The list of characters available for a match are given between the square brackets.
For example, in the character class [ea]
, both e
or a
can match.
Thus if you wanted to match grey
or gray
, you could use the regex gr[ea]y
.
Character classes have their own metacharacters and should be considered as their own language.
^
(caret) and $
(dollar)
^
matches the start of a line, $
matches the end of a line.
As such, ^cat
matches if you have ^
(start of line) followed by c
, followed
by a
, followed by t
, e.g.
printf "cat" | egrep -no '^cat'
matches but
printf "the cat" | egrep -no '^cat'
does not.
^cat$
matches if the line only contains cat
, e.g.
printf "cat" | egrep -no '^cat$'
matches but
printf "cat is fat" | egrep -no '^cat$'
does not.
Similarly, ^$
matches only blank lines (lines without any characters) and
^
matches every line as every line has a start of line.
^
and $
(and some other metacharacters) are special because they match a position in
a line rather than an actual text character.
-
(dash) inside a character class
-
indicates a range of characters in a character class, e.g. [0-9]
is
equivalent to [0123456789]
. Other common ranges are [a-z]
and [A-Z]
.
Ranges can be combined, e.g. [0-9a-fA-F]
for hexadecimal numbers.
-
is not a metacharacter outside of a character class.
Also, if -
is the first character in a character class, it is not a metacharacter.
Thus if you wanted to match a
or -
, you could use the character class [-a]
.
^
(caret) inside a character class
When ^
is the first character inside a character class, it negates the character class,
i.e. the character class matches all characters not listed in the character class.
For example, [^0-9]
matches any character that is not in {0,1,2,3,4,5,6,7,8,9}
.
Thus ^
is a metacharacter both inside and outside a character class (recall outside of
a character class it matches the start of a line).
Thus the meaning of ^
as a metacharacter depends on the context.
Further, if ^
is in a character class but not the first character, it is a normal text
character.
.
(dot)
Suppose you wanted a character class that matched every possible character.
You might start off with something like [A-Za-z0-9]
which matches
alphanumeric characters.
You might then add some punctuation characters, e.g.[A-Za-z0-9,.?;]
.
But what about other symbols like *
?
Eventually you might try to list all ASCII characters in the character class.
Needless to say, such an exercise is tedious and error-prone.
Unsurprisingly, there is a shorthand for this.
.
is shorthand for a character class that matches all possible characters.
However, .
is a normal text character inside a character class.
Suppose you wanted a regex to match the date 19th March 1976.
The regex 03.19.76
matches "03/19/76"
, "03-19-76"
, and "03.19.76"
.
However, it also matches "lottery numbers: 19 203319 7639"
.
This sort of problem is typical when using regex to extract information from textual data.
|
(pipe)
|
allows you to combine subexpressions into an overall expression. It is known as
alternation.
Its syntax is subexpr1|subexpr2|...|subexprN
which matches anytime one of
subexpr1,..., subexprN
matches, e.g.
printf "That Bob is a great guy.\nHe and Robert are friends." | egrep -n 'Bob|Robert'
matches both lines.
\<
(backslash-less than) and \>
(backslash-greater than)
\<
gets the position at the start of a “word” and \>
gets the position at the end
of the “word” (these can also be referred to as metasequences as they consist of more
than one character).
\<
and \>
are word boundary metacharacters.
egrep
considers a “word” to be an alphanumeric sequence, e.g.
printf "The cat sat on the mat\nWhat does concatenation mean?" | egrep -n '\<cat\>'
outputs
1:The cat sat on the mat
but
printf "The cat sat on the mat\nWhat does concatenation mean?" | egrep -n 'cat'
outputs
1:The cat sat on the mat
2:What does concatenation mean?
?
(question mark), +
(plus), *
(star), and {min,max}
(curly braces)
These metacharacters are quantifiers.
A quantifier attaches itself to the immediately preceding item (which can be anything from a single normal text character to an arbitrarily complicated subexpression contained in parentheses).
Each quantifier has a minimum and maximum number.
For the match to be successful, the preceding item has to be matched at least the minimum number of times.
Once a match is successful, the regex will attempt to carry on matching the preceding item as many times as possible, up until the maximum number (or until no more matches are found).
The quantifiers’ minimum and maximum numbers are
- $(0,1)$ for
?
- $(1,+\infty)$ for
+
- $(0,+\infty)$ for
*
- $(\min,\max)$ for
{min,max}
For example, if
TEXT="color\ncolour\ncolouur\ncolouuuuuuuur"
then
printf "$TEXT" | egrep -n 'colou?r'
which is equivalent to
printf "$TEXT" | egrep -n 'colou{0,1}r'
both output
1:color
2:colour
whereas
printf "$TEXT" | egrep -n 'colou{1,1}r'
outputs
2:colour
and
printf "$TEXT" | egrep -n 'colou{1,5}r'
outputs
2:colour
3:colouur
and
printf "$TEXT" | egrep -n 'colou+r'
outputs
2:colour
3:colouur
4:colouuuuuuuur
and
printf "$TEXT" | egrep -n 'colou*r'
outputs
1:color
2:colour
3:colouur
4:colouuuuuuuur
Backreferencing
Not all versions of egrep
support backreferencing, but for those that do,
it allows you to match text matched previously earlier in the regex.
This is achieved by wrapping subexpressions in parentheses. If (subexpr)
matches, the matched text can be referred to later on in the regex as
\1
.
For example, if subexpr1
and subexpr3
in (subexpr1)subexpr2(subexpr3)
both
match, the text matched by subexpr1
can be referred to later on as \1
and the text matched by subexpr3
as \2
, e.g. (subexpr1)subexpr2(subexpr3)\1\2
Basically, pairs of parentheses are numbered by counting opening parentheses from the left.
This is quite powerful as it allows you to construct regexes dynamically and to match generic patterns rather than specific instances of those patterns, e.g. the regex
\<([A-Za-z]+) +\1\>
matches anytime a word is repeated
printf "The cat sat on the the mat" | egrep -no '\<([A-Za-z]+) +\1\>'
outputs
1:the the
and likewise
printf "Whatever the weather weather it rain or shine" | egrep -no '\<([A-Za-z]+) +\1\>'
outputs
1:weather weather
Escaping
This is generally done using \
and we saw an example of it earlier
How are you\?
In the regex How are you?
, the ?
attaches itself to u
and acts as a quantifier.
But in the above example the desired behaviour was to match a literal ?
at the end which
is why the \
was required.
Host features
Some features that have very common usage, e.g. case-insensitive matching, are not actually provided out-of-the-box by the regular expression language.
However, usually such features will be supported by the host, e.g. for case insensitive
matching, egrep
provides the i
option
printf "Whatever the weather WeaTher it rain or shine" | egrep -no '\<([A-Za-z]+) +\1\>'
does not match but
printf "Whatever the weather WeaTher it rain or shine" | egrep -ino '\<([A-Za-z]+) +\1\>'
does
1:weather WeaTher
Disclaimer: In no way, shape, or form do I claim all the content in this post to be my own work / not copied, paraphrased, or derived in any other way from an external source.
To the best of my knowledge, all sources used are referenced. If you feel strongly about any of the content in this post from a plagarism, copyright, etc. point of view, please do not hesitate to get in touch to discuss and resolve the situation.
References
Mastering Regular Expressions, 3rd Edition - Jeffrey Friedl (highly recommended)