Summary: We will help you become a more unreasonable person by using the Unix shell and various command-line tools to fetch bits of data that make up your online world and recombine them the way you want them. Along the way, we'll review how the Unix command line works, including some of the "gotchas" that often trip up new users. Founding Principles - menu vs. command line; language vs. phrase book - GUI users develop such tolerance for repetition and boredom! - Unix tools designed around text manipulation (not XML or .DOC) - "Be liberal in what you accept, and conservative in what you send" - Jon Postel - the output of any program may end up as input - be wary of headers and footers - be wary of text with variable columns (or overlapping columns) The basics of a shell script: - Aside: scripts that execute the wrong programs - Aside: scripts that create files they can't read $ umask 0777 rm -f out date >out wc out - start the script correctly #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 date - you can only redirect what you can see - but silly ls command changes what you see - and you only redirect stdout (not stderr) by default Data mining Data mining is easy, if you build up the Unix pipeline slowly, adding one command at a time and watching the output each time. Some Unix commands select lines from a text stream, others select fields, and some can do both: Select lines from text streams: grep, awk, sed, head, tail, look, uniq, comm, diff Select fields in lines or parts of lines: awk, sed, cut, expr Transform text (change characters or words in lines): awk, sed, tr, Perl, etc. The "sort" command is also useful for putting lines of text in order. Details: - running programs with a preset list of options cp -p - set the options and call the program; but, be careful not to recurse - count how many times words appear in a document tr -cs 'a-zA-Z0-9' '\n' < /etc/termcap \ | sort | uniq -c | sort -nr | less - count the lengths of words in a document tr -cs 'a-zA-Z0-9' '\n' < /etc/termcap \ | tr -c '\n' '.' | sort | uniq -c | sort -nr | less try also: sort +1 - renaming digital photo files to be their date and time - grabbing the current weather - creating a SPAM whitelist from a mail alias file and uploading it hourly to a POPmail server - on-the-fly conversion of Berkeley mail aliases to mutt format