Main Course Webpage

Course Slack Page

In-Class Lab 11: Basic Regular Expressions

The data files used in these exercises are in the directory /pub/cs/grwoo/cs160a/samples/Data on hills. Make sure you examine the data file, run your command, and examine your output carefully to determine if your command works as expected.

All parts of this exercise set require basic regular expressions (BREs), and do not require 'turning on' the extended regular expression operators using the -E option.

Begin by reviewing the Basic Regular Expressions below:

Notes

BREs are understood by every Unix command that understands regular expressions, particularly grep, sed, more and vi

Consider the file tbelow:

$ cat t
abc
bc
abc1d
abcd12
      
Operator Matches Examples using the file t above
. (period) any single character grep '...' matches all but the second line of t
* 0 or more of the preceding character (The character to the left of the *) If * is the first character in the RE, it matches a literal *

* is a repetition operator. It repeats the character before it

grep 'c*d' matches any line with a d (0 or more c's follwed by a d)

grep 'c.*d' matches the last two lines. (c follwed by 0 or more of any character followed by a d)

[[:class:]]

one character that is a member of class. Commonly-used classes are alpha, digit, space, upper, lower, alnum, punct

grep '[[:digit:]]' matches the last two lines

grep '[[:digit:]][[:digit:]]' matches the last line

grep '[[:digit:]][[:alpha:]]' matches the third line

[^abc]

one character that is any except a, b or c

grep '[^d]' matches every line (since each line has a character that is not d)

grep '[^d]$' matches all except the third line

grep '[^[:alpha:]]' matches the last two lines. (Lines that have a non-alphabetic character.)

^ $

anchors. ^ matches the beginning-of-line. $ matches the end-of-line

grep '^a' matches all but the second line

grep 'c$' matches the first two lines

grep '[[:digit:]]$' matches the last line.

Part One

Using the file input1, write commands to output only the lines with the following characteristics:

  1. that contains the word hello anywhere on the line

  2. that start with the word hello

  3. that start with any number (any digit)

  4. that ends with the word hello

  5. that ends with any alphabetic letter (upper- or lower- case) or a question mark

  6. that ends with a period (be careful here)

  7. that contains only the word hello (it's the only thing on the line)

  8. that contains only numbers

  9. that contain only numbers, dashes and space characters

  10. containing more than 9 characters (at least 10 characters. A character can be anything)

  11. that start with any whitespace character

  12. that contain a string. This is anything withi double qutoes. Allow empty strings like ""

  13. repeat the last command, but do not allow empty strings.

  14. a phone number. This is three digits followed by a dash followed by four digits. Notice that this outputs phone numbers with area codes as well.

  15. This time your phone number should not have an area code - only the three digit, dash, four digit local phone number. (You can assume that your phone number isi preceded by a whitespace character.)

  16. Last, allow your phone number to be seven consecutive digts as well as the three digit dash four digit type.

Part Two

In this part we use a delimited file named Depts. It is in the samples directory discussed above. Look at the file Depts. Its format is DeptID:DeptName:EmpID:EmpName. The EmpID is an integer.

Write commands to output only the lines with the following characteristics:

  1. The DeptID begins with an E

  2. The DeptID has exactly two digits

  3. The DeptName starts with M

  4. The DeptName is more than one [alphabetic] word. The words can be separated by multiple spaces.

  5. The EmpID is three digits

Part Three

In this part we will practice with matching lines from other delimited files. The first file, named sorttest, uses the '#' character as the delimiter and it has five fields. Start by examining the sortttest file in the samples directory. Notice that each field ahs a different format. This, coupled with which field we are interested in, enables us to make simplifying assumptions when working with problems. (We will assume the sorttest files is much larger, and this is just a representative sample, so we must be conservative about our assumptions.)

Example:

Output the lines whose last field is Administrator (exactly).

Solution:

Since we are interested in the last field, we know that that the last field is preceded by # and followed by the end of the line. We can use these facts to write a simple RE:

grep '#Administrator$' sorttest

  1. Output lines whose third field is D14

  2. Output lines whose first field is a three digit number.

  3. Output lines whose next-to-last field has at least one uppercase letter in it

    Next we will use a standard system file, the /etc/passwd file, to do a few more interesting problems. Take a look at this file using tail /etc/passwd. You will see lines that look like this:

    gboyd:x:3496:208:Unix/Linux Guy:/users/gboyd:/bin/bash where the fields are username,pass,userid,groupid,gecos,homedir,shell

    We are going to combine our regular expressions with other tools to extract fields from recrods we specify.

  4. Output the shell field of the user gboyd

  5. Output the homedir field of the user cmetzler

  6. Output the username field of the account with the userid 10025

  7. Output all the usernames whose groupid field is 554

  8. Output all the usernames whose gecos field is empty

  9. Output the username field of all users whose userid is five digits and whose whell is not /bin/bash

Turning in your exercise

For the original version of this exerercise as well as solutions refer to Greg Boyd's handout

Submit your answers as an ordinary text file on Canvas