Main Course Webpage

Course Slack Page

In-Class Lab 11: Basic Regular Expressions

The data files used in these exercises are in the directory /pub/cs/grwoo/cs160a/samples/Data on hills. Make sure you examine the data file, run your command, and examine your output carefully to determine if your command works as expected.

All parts of this exercise set require basic regular expressions (BREs), and do not require 'turning on' the extended regular expression operators using the -E option.

Begin by reviewing the Basic Regular Expressions below:

Notes

BREs are understood by every Unix command that understands regular expressions, particularly grep, sed, more and vi

Always quote your regular expressions. For our class, use single-quotes.
Regular expressions can match any part of the line. If you want to control this, use anchors
Don't confuse regular expressions with shell wildcards. Regular expressions are used by one of the commands above to match text. Shell wildcards are used by the shell to match filenames. If you quote your regular expressions, the shell will not confuse them with a wildcard.

Consider the file tbelow:

$ cat t
abc
bc
abc1d
abcd12

Operator	Matches	Examples using the file t above
. (period)	any single character	grep '...' matches all but the second line of t
*	0 or more of the preceding character (The character to the left of the ) If is the first character in the RE, it matches a literal *	* is a repetition operator. It repeats the character before it *grep 'cd' matches any line with a d (0 or more c's follwed by a d) grep 'c.d'* matches the last two lines. (c follwed by 0 or more of any character followed by a d)
[[:class:]]	one character that is a member of class. Commonly-used classes are alpha, digit, space, upper, lower, alnum, punct	grep '[[:digit:]]' matches the last two lines grep '[[:digit:]][[:digit:]]' matches the last line grep '[[:digit:]][[:alpha:]]' matches the third line
[^abc]	one character that is any except a, b or c	grep '[^d]' matches every line (since each line has a character that is not d) grep '[^d]$' matches all except the third line grep '[^[:alpha:]]' matches the last two lines. (Lines that have a non-alphabetic character.)
^ $	anchors. ^ matches the beginning-of-line. $ matches the end-of-line	grep '^a' matches all but the second line grep 'c$' matches the first two lines grep '[[:digit:]]$' matches the last line.

Part One

Using the file input1, write commands to output only the lines with the following characteristics:

that contains the word hello anywhere on the line
that start with the word hello
that start with any number (any digit)
that ends with the word hello
that ends with any alphabetic letter (upper- or lower- case) or a question mark
that ends with a period (be careful here)
that contains only the word hello (it's the only thing on the line)
that contains only numbers
that contain only numbers, dashes and space characters
containing more than 9 characters (at least 10 characters. A character can be anything)
that start with any whitespace character
that contain a string. This is anything withi double qutoes. Allow empty strings like ""
repeat the last command, but do not allow empty strings.
a phone number. This is three digits followed by a dash followed by four digits. Notice that this outputs phone numbers with area codes as well.
This time your phone number should not have an area code - only the three digit, dash, four digit local phone number. (You can assume that your phone number isi preceded by a whitespace character.)
Last, allow your phone number to be seven consecutive digts as well as the three digit dash four digit type.

Part Two

In this part we use a delimited file named Depts. It is in the samples directory discussed above. Look at the file Depts. Its format is DeptID:DeptName:EmpID:EmpName. The EmpID is an integer.

Write commands to output only the lines with the following characteristics:

The DeptID begins with an E
The DeptID has exactly two digits
The DeptName starts with M
The DeptName is more than one [alphabetic] word. The words can be separated by multiple spaces.
The EmpID is three digits

Part Three

In this part we will practice with matching lines from other delimited files. The first file, named sorttest, uses the '#' character as the delimiter and it has five fields. Start by examining the sortttest file in the samples directory. Notice that each field ahs a different format. This, coupled with which field we are interested in, enables us to make simplifying assumptions when working with problems. (We will assume the sorttest files is much larger, and this is just a representative sample, so we must be conservative about our assumptions.)

Example:

Output the lines whose last field is Administrator (exactly).

Solution:

Since we are interested in the last field, we know that that the last field is preceded by # and followed by the end of the line. We can use these facts to write a simple RE:

grep '#Administrator$' sorttest

Output lines whose third field is D14
Output lines whose first field is a three digit number.
Output lines whose next-to-last field has at least one uppercase letter in it

Next we will use a standard system file, the /etc/passwd file, to do a few more interesting problems. Take a look at this file using tail /etc/passwd. You will see lines that look like this:

gboyd:x:3496:208:Unix/Linux Guy:/users/gboyd:/bin/bash where the fields are username,pass,userid,groupid,gecos,homedir,shell

We are going to combine our regular expressions with other tools to extract fields from recrods we specify.
Output the shell field of the user gboyd
Output the homedir field of the user cmetzler
Output the username field of the account with the userid 10025
Output all the usernames whose groupid field is 554
Output all the usernames whose gecos field is empty
Output the username field of all users whose userid is five digits and whose whell is not /bin/bash

Turning in your exercise

For the original version of this exerercise as well as solutions refer to Greg Boyd's handout

Submit your answers as an ordinary text file on Canvas