How to write regular expressions to match text patterns

Regular expressions are a powerful way of searching for expressions or in general patterns in a text file.
Most UNIX based text editors and search tools use regular expressions for search and/or replace actions.
A regular expression comprises of a combination of regular characters, special characters, wildcards with or without length restriction, begin with and end with symbols. So basically with a regexp you define a text pattern which you want the regexp engine to parse and match on the target text and print the matching pattern or return a value denoting a successful match. Usually regexp operates on a line of text.
Now let’s see how pattern matching happens with an example. The below regexp checks if an email address is in correct format or not.
^[a-zA-Z]+[0-9a-zA-Z_.]*[0-9a-zA-Z]+@[0-9a-zA-Z]+[0-9a-zA-Z.-]*[a-zA-Z]+$ |
The above expression looks for text pattern which begins with an alphabet and can have underscore (_) or dot (.) in between and then has a number or alphabet before the @ sign. After the @ sign it should start with an number or alphabet and can have dot (.) or dash (-) in between and then ends with an alphabet.
Now let me explain how this was achieved by the regular expression.
^ - denotes that the target text line should begin with the pattern that follows in the regular expression after ^, if ^ is omitted it will search for the pattern from any position in the text including the beginning.
[] – encloses a single character or a continuous list of characters, any of which can be part of the pattern once like [a] means character ‘a’ should be present, [abc] means either characters ‘a’, ‘b’, ‘c’ can be present (not a combination of them). We can specify a continuous range of characters like this [a-z] which means any one small case alphabet can be present.
+ - is a greedy wildcard which will match one or more number of the preceding character pattern. Like [a]+ will match one or more number of repitions of 'a'.
$ - will say the target text should end with the preceding pattern.
The following list of characters have special meaning in regexp and should be preceded by '\' to search for the actual character.
\ ^ $ . | ? * + ( ) [ { |
Lets see what they mean.
Special Character | Significance | Expression | Match |
---|---|---|---|
\ | Will negate the special meaning of the succeeding character like for example \* will make regexp engine look for * character instead of searching for multiple repititions of the preceeding character. | a \* in the sky | a * in the sky |
^ |
Matches succeeding pattern from begining of the text line But if placed after begining square bracket like [^ ] it does not match any character inside the square bracket. Will see example |
^# | #this is a comment line |
$ | Matches pattern in the text line which ends with the preceding pattern | ;$ | break; |
. | Matches any character except for end of line or newline character | .o |
to go |
| | Matches either of the pattern before or after this character | yes|no |
yes no |
? | Matches one or zero occurance of the preceding character | songs? |
songs song |
* | Matches any number of repitions of the preceding character | mis* |
mi mis miss misss |
+ | Matches one or more occurance of the preceding character | blis+ |
blis bliss |
() | Captures pattern groups | name: ([^ ]+) |
name: Sam () will capture Sam |
[ | Matches any character inside [] | an [aeiou].* | an owl |
[^ | Does not match any character inside [] | a [^aeiou].* | a tree |
{ | Matches specified number or a range of repititions of the preceding character |
te{2} te{1,3}
te{2,} |
tee te tee teee tee teee |
Photo credits - https://store.xkcd.com/products/i-know-regular-expressions
- Log in to post comments