Regular Expressions: A brief introduction

A regular expression, or in short regex, is a string of characters that specifies a pattern. It is a very powerful tool, and is used in a wide variety of applications from search (and replace) and string validation to lexical analysis (which is the first step in a compiler stack where source code is converted into a stream of tokens).

The languages of regular expressions coincides with the languages recognized by finite state automata, which means that for every regular expression, there exists an automaton that can recognize it.

Most programming languages support regex, including for example Python, C, C++, Java, JavaScript and Dart.

@GARABATOKID

If you don’t have a cat, bad news, you’ll have to learn it 😀

Basics of regex

A single character is it self a regular expression that recognized once and only that character.

Let’s consider the following string: ‘I am excited to learn regular expressions!’, characters in bold are considered matched.

Using regular expression r = ‘a’ will match ‘I am excited to learn regular expressions!’

We can use the boolean or operator | to match either r1 or r2: r = r1 | r2

r = ‘am|to’ matches ‘I am excited to learn regular expressions!’

Operators can be applied recursively to extend the regular expression.

Let’s look at what we can do with regex:

Regex Matched set of strings
hi{hi}
hi | hello{hi, hello}
zz*a mandatory z followed by zero or more z: {z,zz,zzz,zzzz,zzzzz,…}
(haha)+at least one occurrence of haha: {haha,hahahaha,hahahahahaha,…}
analy(s|z)e{analyse, analyze}
analog(ue)?? means optional: {analog, analogue}
o{2}exactly two occurrences of letter o: {oo}
[a-z]matches any lowercase letter in the range from a to z: {a,b,c,d,e,f,g,…,z}
^footballmatches any string that begins with football
football$matches any string that ends with football
\dmatches any digit: {0,1,2,3,4,5,6,7,8,9}
[0-9]matches any digit: {0,1,2,3,4,5,6,7,8,9}
.matches any character
a non exhaustive list of regex operators

Regex for validation

Let’s say we are building an application for a specific university or workplace, and we would like to allow users to register using emails that belong only to the domain of the university or company, and it enforces some rules on the special characters used (only – or _)

Allowed emails:

user@mycompany.com

firstname_lastname@mycompany.com

firstname-lastname@mycompany.com

The string we are trying to match here is:

[Any alpha numeric string with characters (- _) ]@mycompany.com

We can use the following regex: [A-Za-z0-9_-]+@mycompany\.com (we need to use \. to escape the dot character since using only . matches has another semantic of matching every character)

There are two groups we can focus on here:

The outer group: ( )+ matches at least one occurrence of the contents of the parentheses

The inner group: [A-Za-z0-9_-] matches either A-Z or a-z or 0-9 or _ or

followed by a fixed string @mycompany\.com

You can find and interact with the above regex example here.

Thanks for reading 🙂

Enjoy regexing!