Regular Expression
A regular expression is a set of characters that specify a pattern, are used when you want to search for specify lines of text containing a particular pattern
Sample.txt
Basic regular expression
vim, sed, grep, more
Anchors are used to specify the position of the pattern in relation to a line of text
$ - matches the end of the line
A$, "A" at the end of a line
The "$" is only an anchor if it is the last character
If you need to match a "^" at the beginning of the line, or a "$" at the end of a line, you must escape the special characters with a backslash
\< and \>, represent the start and end of a word
Character Sets match one or more characters in a single position
- . (dot) - a single character
- Specifying a Range of Characters with [...]
- [0-9], a single digit
- [a-zA-Z], a single character
- [^agd] - the character is not one of those included within the square brackets
- [0-9-z], any number, or any character between "9" and "z"
- [0-9\-a\]], any number, or a "-", a "a", or a "]"
- Remember pattern
Modifiers specify how many times the previous character set is repeated
- * - the preceding character matches 0 or more times
- \{n,m\} - the preceding character matches at least n times and not more than m times, any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper limit
- modifiers like "*" and "\{1,5\}" only act as modifiers if they follow a character set
^*, any line starting with an asterisk
\{4,8\}, any line starting with "{4,8}"
Extended regular expression
awk, egrep
( | ), match a choice of patterns
'^(From|Subject)', "^[FS][ru][ob][mj]e*c*t*" with basic regular expression
? - the preceding character matches 0 or 1 times only
+ - the preceding character matches 1 or more times
\w, matches word characters
\W, matches nonword characters
POSIX character sets
\s, whitespace
\S, nonwhitespace
\d, digit
\D, nondigit
\A, beginning of a string
\b, word boundary
\B, nonword boundary
[[:alnum:], alphanumeric
[:cntrl:], control character
[:lower:], lower case character
[:space:], whitespace
[:alpha:], alphabetic
[:digit:], digit
[:print:], printable character
[:upper:], upper Case Character
[:blank:], whitespace, tabs, etc.
[:graph:], printable and visible characters
[:punct:], punctuation
[:xdigit:], extended Digit
grep "[[:digit:]]" sample.txt
awk
pattern {action}
AWK is line oriented
The default pattern is something that matches every line
awk '/Fred/ {print $3}' sample.txt
BEGIN, specify actions to be taken before any lines are read
END, specify actions to be taken after the last line is read
BEGIN { print "START" }
{ print }
END { print "STOP" }
awk Variable | Meaning |
$0 | Whole line |
$1 | The first field of the input line |
FILENAME | Name of current input file |
RS | Input record separator character |
OFS | Output field separator string |
ORS | Output record separator string |
NF | Number of fields in input record |
NR | Number of input record |
OFMT | Output format of number |
FS | Field separator character |
awk '{print "# of field: " NF " # of records: " NR}' sample.txt
Commands
- if ( conditional ) statement [ else statement ]
- while ( conditional ) statement
- for ( expression ; conditional ; expression ) statement
- for ( variable in array ) statement
- break
- continue
- { [ statement ] ...}
- variable=expression
- print [ expression-list ] [ > expression ]
- printf format [ , expression-list ] [ > expression ]
- next
- exit
Arithmetic
awk '{print $3, $3*10}' sample.txt
awk '{a=$3; b=$3*10; print a, b}' sample.txt
awk '{a=$3; total=total+a; print "Total:", $3, total}' sample.txt
Regular expression
~, match
!~, not match
# f.awk
{
if ($1 ~ /Fred/)
print $1, $3
else
print $0
}
awk -f f.awk sample.txt
# a.awk
/Susy/ {print $1, $3}
awk -f a.awk sample.txt, implement awk command from awk script
# b.awk
BEGIN {
print "--------------------------"
print "-------Sample.txt---------"
print "--------------------------"
}
{
total = total + $3
}
END {
printf "Total: %10d\n", total
}
awk -f b.awk sample.txt
Flow control
# c.awk
BEGIN {
print "Input an arithmetic expression: "
}
{
if ( $2 == "+")
result = $1 + $3
else if ( $2 == "*")
result = $1 * $3
else
{
print "Operator is illegal ..."
exit 1
}
}
END {
printf "Result: %10d\n", result
}
awk -f c.awk
1 + 2
Ctrl + D
Loop
# d.awk
BEGIN {
print "==========Loop==========="
}
{
sum = 0
for( i = 0; i < 10; i++)
{
sum += i
}
printf "Total: %10d\n", sum
exit 1
}
# e.awk
BEGIN {
print "==================="
}
{
for(j = 1; j <= NF; j++)
printf "%10s", $j
printf "\n"
}
Associate array
# g.awk
BEGIN {
print "===========User List==========="
idx = 0
}
{
userName[idx] = $1
idx++
}
END {
for(i = 0; i < idx; i++)
print userName[i];
}
# h.awk
BEGIN {
print "===========User List==========="
}
{
userName[$1] = $3
}
END {
for(n in userName)
print n, userName[n];
}
Numerical Functions
# i.awk
BEGIN {
print "Arithmetic functions"
print "===================="
}
{
printf "%10s%10f\n", $1, cos($3)
}
# j.awk
BEGIN {
print "Random Number"
print "===================="
}
{
printf "%10s%10f\n", $1, rand()
}
String Functions
index(string,search)
length(string)
split(string,array,separator)
{
n = split($0, array, " ")
for (i = 1; i <= n; i++)
printf "%10s", array[i]
printf "\n"
}
substr(string,position)
sub(regex,replacement, string), substitute the first match
gsub(regex,replacement, string), substitute with g option
{
if(gsub("[aeiou]", "-", $0))
print $0
}
match(string,regex)
{
if (match($1, /Fred/))
printf "%10s%10f\n", $1, rand()
}
system
{
if(system("cat n.awk") != 0)
print "Command does not work ..."
}
sed
/g, global replacement
/p, print
/w, write to a file
/I, ignore case
/d, delete
/!, reversing
-n, not print anything unless an explicit request to print is found
Substitution
sed 's/Fred/Lin/g' sample.txt > temp.txt, replace "Fred" by "Lin"
sed 's/\(Susy\)\{1,\}/Lin/g' sample.txt, substitute one or more "Susy" with one "Lin"
sed 's/Susy/(&)/g' sample.txt, use & to represent the found string
sed -E 's/[0-9]+/(& &)/g', use extended regular expression with "-E" on Mac, "-r" on Linux system
sed 's/^\([a-zA-Z]\{4\}\) .*\([0-9][0-9]*\)/\2 \1/g sample.txt, remeber the patter 1 and 2 and substitute the line with 2 and 1
sed 's/fred/lin/Ig' sample.txt, substitute 'Fred', 'FRED', et.al. by 'lin'
sed -e 's/a/A/' -e 's/b/B/' sample.txt, multiple commands in one line
sed '2,8 s/Susy/Lin/g' sample.txt, substitle "Susy" from line 2 to line 8 by "Lin"
sed '/Fred/s/20/10/g' sample.txt, substitute "20" by "10" in the line containing "Fred"
sed '/Fred/s//Lin/g' sample.txt, substitute "Fred" by "Lin" in the line containing "Fred"
sed '/^[a-zA-Z]\{4\}/s//Lin/g' sample.txt, substitute the name containing four characters by "Lin"
sed '/^$/d', delete blank line
who | sed -n '/lchen/p', search 'lchen' in the output of command who
sed -n '/Susy/p', search the lines containing "Susy" and print them out
sed -n '/Fred/!p' sample.txt, print the line which does not contain "Fred"
sed '10 quit' sample.txt, quit at line 10
sed '/Susy/ i\ Add this line before every line with WORD', insert a line before the lines containing "Susy"
sed -n "/Susy/=", print the line number for the lines containing "Susy"
sed 'y/abcd/ABCD/' sample.txt, transfer "a" to "A", "b" to "B", et. al.
sed -f s.sed sample.txt, implement sed commands from sed script
1i\
Substitute the price in the line containing "Fred"
/Fred/s/20/10/g
vim
/[pattern], search words matching a specific pattern
/Fred, find "Fred"
/\<Susy\>, search the single word Susy, not "SusySusy"
/\s\d$, search a single digit at the end of the line
/[aeiou]\{2\}, search the string which contains two consecutive vowel
/1.\{1,\}, search a number having two digits and starting with "1"
/".\{-\}", non-greedy search the content between two doule qutation marks
:range s[ubstitute]/pattern/string/cgiI
range
- %, the whole file
- number, an absolute line number
- ., the current line
- $, the last line in the file
- 't, position of mark "t"
- /pattern/, the next line where text "pattern" matches
- ?pattern?, the previous line where text "pattern" matches
cgiI
- c, confirm each substitution
- g, replace all occurrences in the line
- i, ignore case for the pattern
- I, don't ignore case for the pattern
:/me/ s/me/lin/g, substitute "me" by "lin" in the next line where the pattern matches
10,15, s/me/lin/g, substitute "me" from line 10 to line 15
10+1, 15, s/me/lin/g, substitute "me" from line 11 to line 15
:/me/ y, search the next line where the pattern matches and copy to the memory
:// normal p, search for the next Section line and put (paste) the saved text on the next line
:%s/me/lin/g, substitute "me" in the whole file by "lin"
:%s/[aeiou]\{2\}/VOWEL/g, replace the string which contains two consecutive vowel with "VOWEL"
:%s/\<Susy\>/TEMP/g, substitute the single word Susy with "TEMP"
:%s/\d\{2,\}$/100/g, substitute the two digit number by 100
:%s/\(Susy\)\{2,\}/Susy/g, substitute the repeat "Susy" by a single "Susy"
grep
grep -n "mellon" sample.txt, match "mellon" in sample.txt and display the line numbers
grep -c "mellon" sample.txt, display how many lines match the pattern
grep -i "fred" sample.txt, make the search case insensitive
grep -v "mellon" sample.txt, take the complement of the regular expression
grep -l "mellon" *, print the filenames of files with lines which match the expression
grep --color=auto "^[A-K]", color the found key words
grep '[aeiou]\{2,\}' sample.txt, search the string which contains two consecutive vowel
grep "\<Susy\>" sample.txt, search the single word Susy, not "SusySusy"
grep "2.\{1,\}" sample.txt, search a number having two digits and starting with "2"
grep "\(Susy\)\{2,\}" sample.txt, search the string containing two consecutive "Susy"
grep "^[a-zA-Z]\{4\}\>" sample.txt, search a line starting with four characters
grep "\s[[:digit:]]\{1\}$" sample.txt --color=auto, search single digit at the end of the line
cat /etc/passwd | grep root
dmesg | grep -n --color=auto 'eth'
grep -r ‘energywise’ *, search the pattern in the current directory and its sub directories
egrep
egrep -n "mellon" sample.txt, match "mellon" in sample.txt and display the line numbers
egrep -c "mellon" sample.txt, display how many lines match the pattern
egrep -i "fred" sample.txt, make the search case insensitive
egrep -v "mellon" sample.txt, take the complement of the regular expression
egrep -l "mellon" *, print the filenames of files with lines which match the expression
egrep --color=auto "^[A-K]", color the found key words
egrep '[aeiou]{2,}' sample.txt, search the string which contains two consecutive vowel
egrep "\<Susy\>" sample.txt, search the single word Susy, not "SusySusy"
egrep "2.+" sample.txt, search a number having two digits and starting with "2"
egrep "(Susy){2,}" sample.txt, search the string containing two consecutive "Susy"
egrep "^[a-zA-Z]{4}\>" sample.txt, search a line starting with four characters
egrep "(or|is|go)" sample.txt, search the string containing "or", "is", or "go"
egrep "2$" sample.txt, search the string ending with "2"
egrep '^[A-K]' sample.txt, search the string starting with "A" to "K"
Reference