8.1 Linux Advanced Text Processing Tools

/usr/games/banner -w79 "Happy Birthday, Marie" > marie.txt

Create an ascii "banner" with the width of 79 characters. The output is sent to file marie.txt. Funny, old-fashioned tool. Another utilty for asci text art is figlet. E.g. figlet "Funny!" produces this on in my terminal (I always use a fixed-size font to display ascii art):

_____ _

| ___| _ _ __ _ __ _ _| |

| |_ | | | | '_ \| '_ \| | | | |

| _|| |_| | | | | | | | |_| |_|

|_| \__,_|_| |_|_| |_|\__, (_)



Start logging my current session in the text terminal into a text file typescript (this is the default filename). The logging finishes when I type exit or press <Ctrl>d. Then, I can raname, email (or whatever I want to do with it) the file typescript.


(in X-terminal) The emacs text editor. Advanced and sophisticated text editor. Seems for gurus only: "emacs is not just an editor, it is a way of living". Emacs surely seems rich or bloated, depending on your point of view. There are likely 3 versions of emacs installed on your system: (1) text-only: type emacs in a text (not X-windows) terminal (I avoid this like fire); (2) graphical-mode: type emacs in an X-windows terminal (fairly usable even for a newbie if you take some time to learn it); and (3) X-windows mode: type "xemacs" in an X-windows terminal.


The famous (notorious?) "vi" text editor (definitely not recommended for newbies). To exit "vi" (no changes saved) use these five characters: <ESC>:q!<Enter> I use the "kate&" (under X) or "pico" (command line) or "nano" (command line) text editors and don't ever need vi (well, unless I have to unmount the /usr subsystem and modify/edit some configuration files, then vi is the only editor avialable). To be fair, modern Linux distributions use vim (="vi improved") in place of vi, and vim is somewhat better than the original vi. The GUI version of vi is also available (type gvim in an X terminal). Here is one response I have seen to the criticism of vi interface being not "intuitive": "The only intuitive interface is the nipple. The rest must be learned." (Well, so much for MS Windows being an "intuitive" interface.)

Experts do like vi, but vi is definitely difficult unless you use it very often. Here is a non-newbie opinion on vi (http://linuxtoday.com/stories/16620.html):

"I was first introduced to vi in 1988 and I hated it. I was a freshman in college... VI seemed archaic, complicated and unforgiving... It is now 12 years later and I love vi, in fact it is almost the only editor I use. Why the change? I actually learned to use vi... Now I see vi for what it really is, a powerful, full featured, and flexible editor..."

For your entertainment, you might want to try the even more ancient-looking line-editor ed (just type ed on the command line). Tools like these, however "inconvenient" in interactive use, can be very useful for automation of manipulation of files from within another program.

Brief Introduction to vim (="visual editor improved") which is a modern Linux version of vi. The main reason why a newbie like myself ever needs vi is for rescue--sometimes it is the only editor available. The most important thing to understand about vi is a "modal" editor, i.e., it has a few modes of operation between which user must switch. The quick reference is below, the 4 essential commands are in bold.

The commands to switch modes:

The key Enters the mode Remarks

<ESC> command mode (get back to the command mode from any editing mode)

i "insert" editing mode (start inserting before the current position of the cursor)


Copying, cutting and pasting (in the command mode):

v start highlighting text. Then, move the cursor to highlight text

y copy highlighted text

x cut highlighted text

p paste text that has been cut/copied

Saving and quitting (from the command mode):

:w write (=save)

:w filename write the contents to the file "filename"

:x save and exit

:q quit (it won't let you if changes not saved)

:q! quit discarding changes (you will not be prompted if changes not saved)


This is a brand new (March 2001) GNU replacement for pico. Works and looks like pico, but it is smaller, better, and licenced as expected for a decent piece of Linux software (i.e., General Public Licence, GPL).


(in X terminal) Simple hexadecimal editor. Another hexadecimal editor is hexedit (text based, less user friendly). Hex editors are used for editing binary (non-ASCII) files.

diff file1 file2 > patchfile

Compare contents of two files and list any differences. Save the output to the file patchfile.

sdiff file1 file2

Side-by-side comparison of two text files. Output goes to the "standard output" which normally is the screen.

patch file_to_patch patchfile

Apply the patch (a file produced by diff, which lists differences between two files) called patchfile to the file file_to_patch. If the patch was created using the previous command, I would use: patch file1 patchfile to change file1 to file2.

grep filter

Search content of text files for matching patterns. It is definitely worth learning at least the basics of this command.

A simple example. The command:

cat * | grep my_word | more

will search all the files in the current working directory (except files starting with a dot) and print the lines which contain the string "my_word".

A shorter form to achieve the same may be:

grep my_word * |more

The patterns are specified using a powerful and standard notation called "regular expressions".

There is also a "recursive" version of grep called rgrep. This will search all the files in the current directory and all its subdirectories for my_word and print the names of the files and the matching line:

rgrep -r my_word . | more

Regular expressions (regexpr)

Regular experessions are used for "pattern" matching in search, replace, etc. They are often used with utilities (e.g., grep,sed) and programming languages (e.g., perl). The shell command dir, uses a slightly modifed flavour of regular expressions (the two main differences are noted below). This brief writeup includes almost all the features of standard regular expression--regexpressions are not as complicated as they might seem at first. Definitely worth a closer look at.

In regular expressions, most characters just match themselves. So to search for string "peter", I would just use a searchstring "peter". The exceptions are so-called "special characters" ("metacharacters"), which have special meaning.

The regexpr special characters are: "\" (backslash), "." (dot), "*" (asterisk), "[" (bracket), "^" (caret, special only at the beginnig of a string), "$" (dollar sign, special only at the end of a string). A character terminating a pattern string is also special for this string.

The backslash, "\" is used as an "escape" character, i.e., to quote a subsequent special character.

Thus, "\\" searches for a backslash, "\." searches for a dot, "\*" searches for the asterisk, "\[" searches for the bracket, "\^" searches for the caret even at the begining of the string, "\$" searches for the dollar sign even at the end of the string.

Backslash followed by a regular (non-special) character may gain a special meaning. Thus, the symbols \< and \> match an empty string at the beginning and the end of a word, respectively. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word.

The dot, ".", matches any single character. [The dir command uses "?" in this place.] Thus, "m.a" matches "mpa" and "mea" but not "ma" or "mppa".

Any string is matched by ".*" (dot and asterisk). [The dir command uses "*" instead.] In general, any pattern followed by "*" matches zero or more occurences of this pattern. Thus, "m*" matches zero or more occurances of "m". To search for one or more "m", I could use "mm*".

The * is a repetition operator. Other repetition operators are used less often--here is the full list:

* the proceding item is to be matched zero or more times;

\+ the preceding item is to be matched one or more times;

\? the preceding item is optional and matched at most once;

\{n} the preceding item is to be matched exactly n times;

\{n,} the preceding item is to be matched n or more times;

\{n,m} the preceding item is to be matched at least n times, but not more than m times.

The caret, "^", means "the beginning of the line". So "^a" means "find a line starting with an "a".

The dollar sign, "$", means "the end of the line". So "a$" means "find a line ending with an "a".

Example. This command searches the file myfile for lines starting with an "s" and ending with an "n", and prints them to the standard output (screen):

cat myfile | grep '^s.*n$'

Any character terminating the pattern string is special, precede it with a backslash if you want to use it within this string.

The bracket, "[" introduces a set. Thus [abD] means: either a or b or D. [a-zA-C] means any character from a to z or from A to C.

Attention with some characters inside sets. Within a set, the only special characters are "[", "]", "-", and "^", and the combinations "[:", "[=", and "[.". The backslash is not special within a set.

Useful categories of characters are (as definded by the POSIX standard): [:upper:] =upper-case letters, [:lower:] =lower-case letters, [:alpha:] =alphabetic (letters) meaning upper+lower, [:digit:] =0 to 9, [:alnum:] =alphanumeric meaning alpha+digits, [:space:] =whitespace meaning <Space>+<Tab>+<Newline> and similar, [:graph:] =graphically printable characters except space, [:print:] =printable characters including space, [:punct:] =punctuation characters meaning graphical characters minus alpha and digits, [:cntrl:] =control characters meaning non-printable characters, [:xdigit:] = characters that are hexadecimal digits.

Example. This command scans the output of the dir command, and prints lines containing a capital letter followed by a digit:

dir -l | grep '[[:upper:]][[:digit:]]'


(=translation). A filter useful to replace all instances of characters in a text file or "squeeze" the white space.

Example :

cat my_file | tr 1 2 > new_file

This command takes the content of the file my_file, pipes it to the translation utility tr, the tr utility replaces all instances of the character "1" with "2", the output from the process is directed to the file new_file.


(=stream editor) I use sed to filter text files. The pattern to match is typically included between a pair of slashes // and quoted.

For example, to print lines containing the string "1024", I may use:

cat filename | sed -n '/1024/p'

Here, sed filters the output from the cat command. The option "-n" tells sed to block all the incoming lines but those explicitly matching my expression. The sed action on a match is "p"= print.

Another example, this time for deleting selected lines:

cat filename | sed '/.*o$/d' > new_file

In this example, lines ending the an "o" will be deleted. I used a regular expression for matching any string followed by an "o" and the end of the line. The output (i.e., all lines but those ending with "o") is directed to new_file.

Another example. To search and replace, I use the sed 's' action, which comes in front of two expressions:

cat filename | sed 's/string_old/string_new/' > newfile

A shorter form for the last command is:

sed 's/string_old/string_new/' filename > newfile

To insert a text from a text file into an html file called "index_master_file.html", I may use a script containing:

sed '/text_which_is_a_placeholder_in__my_html_file/r text_file_to_insert.txt' index_master_file.html > index.html


(=GNU awk. The awk command is a traditional UNIX tool.) A tool for processing text files, in many respects similar to sed, but more powerful. Perl can do all that gawk can, and more, so I don't bother with gawk too much. For simple tasks, I use sed, for more complicated tasks, I use perl. In some instances, however, awk scripts can be much shorter, easier to understand and maintain, and faster than an equivalent perl program.

gawk is particularly suitable for processing text-based tables. A table consists of records (each line is normally one record). The records contain fields separated by a delimiter. Often used delimiters are whitespace (gawk default), comma, or colon. All gawk expressions have a form: gawk 'pattern {action}' my_file. You can ommit the patern or action: the default pattern is "match everything" and the default action is "print the line". gawk can also be used as a filter (to process the output from another command, as used in our examples).

Example. To print lines containing the string "1024", I may use:

cat filename | gawk '/1024/ {print}'

Like in sed, the patterns to match are enclosed in a pair of "/ /".

What makes gawk more powerful than sed is the operations on fields. $1 means "the first field", $2 means "the second field", etc. $0 means "the entire line". The next example extracts fields 3 and 2 from lines containing "1024" and prints them with added labels "Name" and "ID". The printing goes to a file called "newfile":

cat filename | gawk '/1024/ {print "Name: " $3 "ID: " $2}' > newfile

The third example finds and prints lines with the third field equal to "peter" or containing the string "marie":

cat filename | gawk '$3 == "peter" || $3 ~ /marie/ '

To understand the last command, here is the list of logical tests in gawk: == equal, != not equal, < less than, > greater than, <= less than or equal to, >= greater than or equal to, ~ matching a regular expression, !~ not matching a regular expression, || logical OR, && logical AND, ! logical NOT.


Concurrent versions system. Try: info cvs for more information. Useful to keep the "source code repository" when several programmers are working on the same computer program.


(in X-terminal). A GUI front-end to the cvs versioning system.

file -z filename

Determine the type of the file filename. The option -z makes file look also inside compressed files to determine what the compressed file is (instead of just telling you that this is a compressed file).

To determine the type of content, file looks inside the file to find particular patterns in contents ("magic numbers")--it does not just look at the filename extension like MS Windows does. The "magic numbers" are stored in the text file /usr/share/magic--really impressive database of filetypes.

touch filename

Change the date/time stamp of the file filename to the current time. Create an empty file if the file does not exist. You can change the stamp to any date using touch -t 200201311759.30 (year 2002 January day 31 time 17:59:30).

There are three date/time values associated with every file on an ext2 filesystem:

- the time of last access to the file (atime)

- the time of last modification to the file (mtime)

- the time of last change to the file's inode (ctime).

Touch will change the first two to the value specified, and the last one always to the current system time. They can all be read using the stat command (see the next entry).

stat filename

Print general info about a file (the contents of the so-called inode).

strings filename | more

Display the strings contained in the binary file called filename. "strings" could, for example, be a useful first step to a close examination of an unknown executable.


(=octal dump). Display contents as octal numbers. This can be useful when the output contains non-printable characters. For example, a filename may contain non-printable characters and be a real pain. This can also be handy to view binary files.


dir | od -c | more

(I would probably rather do: ls -b to see any non-printable characters in filenames).

cat my_file | od -c |more

od my_file |more

Comparison of different outputs:

Show 16 first characters from a binary (/bin/sh) as ASCII characters or backslash escapes (octal):

od -N 16 -c /bin/sh


0000000 177 E L F 001 001 001 \0 \0 \0 \0 \0 \0 \0 \0 \0

Show the same binary as named ASCII characters:

od -N 16 -a /bin/sh


0000000 del E L F soh soh soh nul nul nul nul nul nul nul nul nul

Show the same binary as short hexcadecimals:

od -N 16 -t x1 /bin/sh


0000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00

Show the same binary as octal numbers:

od -N 16 /bin/sh


0000000 042577 043114 000401 000001 000000 000000 000000 000000


(=word count) Print the number of lines, words, and bytes in the file.


dir | wc

cat my_file | wc

wc myfile

cksum filename

Compute the CRC (="cyclic redundancy check") for file filename to verify its integrity.

md5sum filename

Compute a md5 checksum (128-bit) for file filename to verify its integrity.

mkpasswd -l 10

Make a hard-to-guess, random password of the length of 10 characters.

sort -f filename

Arrange the lines in filename according to the ascii order. The option -f tells sort to ignore the upper and lower character case. The ascii character set is (see man ascii):

Dec Hex Char Dec Hex Char Dec Hex Char Dec Hex Char


0 00 NUL '\0' 32 20 SPACE 64 40 @ 96 60 `

1 01 SOH 33 21 ! 65 41 A 97 61 a

2 02 STX 34 22 " 66 42 B 98 62 b

3 03 ETX 35 23 # 67 43 C 99 63 c

4 04 EOT 36 24 $ 68 44 D 100 64 d

5 05 ENQ 37 25 % 69 45 E 101 65 e

6 06 ACK 38 26 & 70 46 F 102 66 f

7 07 BEL '\a' 39 27 ' 71 47 G 103 67 g

8 08 BS '\b' 40 28 ( 72 48 H 104 68 h

9 09 HT '\t' 41 29 ) 73 49 I 105 69 i

10 0A LF '\n' 42 2A * 74 4A J 106 6A j

11 0B VT '\v' 43 2B + 75 4B K 107 6B k

12 0C FF '\f' 44 2C , 76 4C L 108 6C l

13 0D CR '\r' 45 2D - 77 4D M 109 6D m

14 0E SO 46 2E . 78 4E N 110 6E n

15 0F SI 47 2F / 79 4F O 111 6F o

16 10 DLE 48 30 0 80 50 P 112 70 p

17 11 DC1 49 31 1 81 51 Q 113 71 q

18 12 DC2 50 32 2 82 52 R 114 72 r

19 13 DC3 51 33 3 83 53 S 115 73 s

20 14 DC4 52 34 4 84 54 T 116 74 t

21 15 NAK 53 35 5 85 55 U 117 75 u

22 16 SYN 54 36 6 86 56 V 118 76 v

23 17 ETB 55 37 7 87 57 W 119 77 w

24 18 CAN 56 38 8 88 58 X 120 78 x

25 19 EM 57 39 9 89 59 Y 121 79 y

26 1A SUB 58 3A : 90 5A Z 122 7A z

27 1B ESC 59 3B ; 91 5B [ 123 7B {

28 1C FS 60 3C < 92 5C \ '\\' 124 7C |

29 1D GS 61 3D = 93 5D ] 125 7D }

30 1E RS 62 3E > 94 5E ^ 126 7E ~

31 1F US 63 3F ? 95 5F _ 127 7F DEL

If you wondered about the control characters, here is the meaning of some of them on the console (Source: man console_codes). Each line below gives the code mnemonics, its ASCII decimal number, the key combination to produce the code on the console, and a short description:

BEL (7, <Ctrl>G) bell (=alarm, beep).

BS (8, <Ctrl>H) backspaces one column (but not past the beginning of the line).

HT (9, <Ctrl>I) horizonal tab, goes to the next tab stop or to the end of the line if there is no earlier tab stop.

LF (10, <Ctrl>J), VT (11, <Ctrl>K) and FF (12, <Ctrl>L) all three give a linefeed.

CR (13, <Ctrl>M) gives a carriage return.

SO (14, <Ctrl>N) activates the G1 character set, and if LF/NL (new line mode) is set also a carriage return.

SI (15, <Ctrl>O) activates the G0 character set.

CAN (24, <Ctrl>X) and SUB (26, <Ctrl>Z) interrupt escape sequences.

ESC (27, <Ctrl>[) starts an escape sequence.

DEL (127) is ignored.

CSI (155) control sequence introducer.


(=unique) Eliminate duplicate lines in sorted input. Example: sort myfile | uniq

fold -w 30 -s my_file.txt > new_file.txt

Wrap the lines in the text file my_file.txt so that there is 30 characters per line. Break the lines on spaces. Output goes to new_file.txt.

fmt -w 75 my_file.txt > new_file.txt

Format the lines in the text file to the width of 75 characters. Break long lines and join short lines as required, but don't remove empty lines.

nl myfile > myfile_lines_numbered

Number the lines in the file myfile. Put the output to the file myfiles_lines_numbered.

indent -kr -i8 -ts8 -sob -l80 -ss -bs -psl "$@" *.c

Change the appearance of "C" source code by inserting or deleting white space. The formatting options in the above example conform to the style used in the Linux kernel source code (script /usr/src/linux/scripts/Lindent). See man indent for the description of the meaning of the options. The existing files are backed up and then replaced with the formatted ones.

rev filename > filename1

Print the file filename, each line in reversed order. In the example above, the output is directed to the file filename1.

shred filename

Repeatedly overwrite the contents of the file filename with garbage, so that nobody will ever be able to read its original contents again.

paste file1 file2 > file3

Merge two or more text files on lines using <Tab> as delimiter (use option "d=" to specify your own delimiter(s).

Example. If the content of file1 was:




and file2 was:





the resulting file3 would be:

1 a

2 b

3 c


join file1 file2 > file3

Join lines of two files on a common field. join parallels the database operation "join tables", but works on text tables. The default is to join on the first field of the first table, and the default delimiter is white space. To adjust the defauls, I use options which I find using man join).

Example. if the content of file1 was:

1 Barbara

2 Peter

3 Stan

4 Marie

and file2 was:

2 Dog

4 Car

7 Cat

the resulting file3 would be:

2 Peter Dog

4 Marie Car

des -e plain_file encrypted_file

(="Data Encryption Standard") Encrypt plain_file. You will be ask for a key that the program will use for encryption. Output goes to encrypted_file. To decrypt use

des -d encrypted_file decrypted_file.


"Gnu Privacy Guard"--a free equivalent of PGP ("Pretty Good Privacy"). gpg is more secure than PGP and does not use any patented algorithms. gpg is mostly used for signing your e-mail messages and checking signatures of others. You can also use it to encrypt/decrypt messages. http://www.gnupg.org/ contains all the details, including a legible, detailed manual.

To start, I needed a pair of keys: private and public. The private key is used for signing my messages. The public key I give away so that others can use it to verify my signatures. [One can also use a public key to encrypt a message so it can only be read using my private key.] I generated my keypair using this command:

gpg --gen-key

My keys are stored in the directory ~/.gnupg (encrypted using a passphrase I supplied during the key generation). To list my public key in plain text file, I use:

gpg --armor --export my_email_address > public_key_stan.gpg

which created a file public_key_stan.gpg containing something like this:


Version: GnuPG v1.0.1 (GNU/Linux)

Comment: For info see http://www.gnupg.org






















Now, I can e-mail my public key to the people with whom I want to communicate securely. They can store it on their pgp system using;

gpg --import public_key_stan.gpg

Even better, I can submit my public key to a public key server. To find a server near me, I used:

host -l pgp.net | grep wwwkeys

and to submit the key, I did (can take a couple of minutes, and I am connected to the Internet):

gpg --keyserver wwwkeys.pgp.net --send-keys linux_nag@canada.com

The "wwwkeys.pgp.net" is the key server I selected, and "linux_nag@canada.com" is my email address that identifies me on my local key ring. I need to submit myself only to one public key server (they all synchronize).

Now, I can start using gpg. To manually sign a plain text file my_message, I could use:

gpg --clearsign my_message

This created file my_message.asc which may contain something like:


Hash: SHA1

Hello World!


Version: GnuPG v1.0.1 (GNU/Linux)

Comment: For info see http://www.gnupg.org





To verify a signed message, I could do:

gpg --verify my_message.asc

If the contents of the signed section in my_message.asc was even slightly modified, the signature will not check.

Manual signing can be awkward. But, for example, kmail can sign the electronic signatures automatically for me.

"docbook" tools

Docbook is the incoming standard for document depository. The docbooks tools are included with RH6.2 (and later) in the package "jade" and include the following converters: db2ps, db2pdf,db2dvi,db2html,db2rtf which convert docbook files to: postscript (*.ps), Adobe Portable Document Format (*.pdf), device independent file format (*.dvi), HyperText Markup Language (*.html), and Rich Text Format (*.rtf), respectively.

"Document depository" means the document is in a format that can be automatically translated into other useful formats. For example, consider a document (under development) which may, in the future, need to be published as a report, a journal paper, a newspaper article, a webpage, perhaps a book, I (the author) am still uncertain. Formatting the document using "hard codes" (fonts, font sizes, page breaks, line centering, etc.) is rather a waste of time--styles vary very much between the particular document types and are publisher-dependent. The solution is to format the document using "logical" layout elements which may include the document title, chapter titles, subchapters, emphasis style, picture filenames, caption text, tables, etc. Thats what "docbook" does--it is a description of styles (using xml, a superset of html, and a close relative of sgml)--a so-called stylesheet. The logical layout is rendered to a physical appearance when the document is being published.

This section will be expanded in the future as we learn to use docbook.

Suggest a Site