diff options
author | Michael Shalayeff <mickey@cvs.openbsd.org> | 2003-06-26 16:20:05 +0000 |
---|---|---|
committer | Michael Shalayeff <mickey@cvs.openbsd.org> | 2003-06-26 16:20:05 +0000 |
commit | 5c04224736dfe4d6d8a8945af0ad161adfa2f695 (patch) | |
tree | 4312ff29915af7d0ae5e3eefa84fbbdcf6cef523 /usr.bin/awk/USD.doc | |
parent | 5b659308c5566ff41f558ef1c87f42abd8ebb52c (diff) |
caldera-licensed doc
Diffstat (limited to 'usr.bin/awk/USD.doc')
-rw-r--r-- | usr.bin/awk/USD.doc/Makefile | 11 | ||||
-rw-r--r-- | usr.bin/awk/USD.doc/awk | 1445 |
2 files changed, 1456 insertions, 0 deletions
diff --git a/usr.bin/awk/USD.doc/Makefile b/usr.bin/awk/USD.doc/Makefile new file mode 100644 index 00000000000..10d161be008 --- /dev/null +++ b/usr.bin/awk/USD.doc/Makefile @@ -0,0 +1,11 @@ +# $OpenBSD: Makefile,v 1.1 2003/06/26 16:20:04 mickey Exp $ + +DIR= usd/16.awk +SRCS= awk +MACROS= -ms +REFER= refer -e -p /usr/dict/papers/Ind + +paper.ps: ${SRCS} + ${REFER} ${SRCS} | ${TBL} | ${ROFF} > ${.TARGET} + +.include <bsd.doc.mk> diff --git a/usr.bin/awk/USD.doc/awk b/usr.bin/awk/USD.doc/awk new file mode 100644 index 00000000000..36bc03eca69 --- /dev/null +++ b/usr.bin/awk/USD.doc/awk @@ -0,0 +1,1445 @@ +.\" $OpenBSD: awk,v 1.1 2003/06/26 16:20:04 mickey Exp $ +.\" +.\" Copyright (C) Caldera International Inc. 2001-2002. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code and documentation must retain the above +.\" copyright notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed or owned by Caldera +.\" International, Inc. +.\" 4. Neither the name of Caldera International, Inc. nor the names of other +.\" contributors may be used to endorse or promote products derived from +.\" this software without specific prior written permission. +.\" +.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA +.\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR +.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES +.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. +.\" IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE FOR ANY DIRECT, +.\" INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES +.\" (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +.\" SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, +.\" STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING +.\" IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +.\" POSSIBILITY OF SUCH DAMAGE. +.\" +.\" @(#)awk 8.2 (Berkeley) 6/1/94 +.\" +.EH 'USD:16-%''Awk \(em A Pattern Scanning and Processing Language' +.OH 'Awk \(em A Pattern Scanning and Processing Language''USD:16-%' +.\" .fp 3 G no G on APS (use gb) or Dandelion Printer (use CW) +.\" the .T is only a ditroff feature... +.if '\*.T'dp' .fp 3 El +.if '\*.T'aps' .fp 3 gB +....TM "78-1271-12, 78-1273-6" 39199 39199-11 +.ND "September 1, 1978" +....TR 68 +.\".RP +. \" macros here +.tr _\(em +.if t .tr ~\(ap +.tr |\(or +.tr *\(** +.de UC +\&\\$3\s-1\\$1\\s0\&\\$2 +.. +.de IT +.if n .ul +\&\\$3\f2\\$1\fP\|\\$2 +.. +.de UL +.if n .ul +\&\\$3\f3\\$1\fP\&\\$2 +.. +.de P1 +.DS I 3n +.nf +.if n .ta 5 10 15 20 25 30 35 40 45 50 55 60 +.if t .ta .3i .6i .9i 1.2i +.if t .tr -\-'\(fm*\(** +.if t .tr _\(ul +.ft 3 +.lg 0 +.ss 18 +. \"use first argument as indent if present +.. +.de P2 +.ps \\n(PS +.vs \\n(VSp +.ft R +.ss 12 +.if n .ls 2 +.tr --''``^^!! +.if t .tr _\(em +.fi +.lg +.DE +.. +.hw semi-colon +.hy 14 +. \"2=not last lines; 4= no -xx; 8=no xx- +. \"special chars in programs +.de WS +.sp \\$1 +.. +. \" end of macros +.TL +Awk \(em A Pattern Scanning and Processing Language +.br +(Second Edition) +.AU "MH 2C-522" 4862 +Alfred V. Aho +.AU "MH 2C-518" 6021 +Brian W. Kernighan +.AU "MH 2C-514" 7214 +Peter J. Weinberger +.AI +.MH +.AB +.IT Awk +is a programming language whose +basic operation +is to search a set of files +for patterns, and to perform specified actions upon lines or fields of lines which +contain instances of those patterns. +.IT Awk +makes certain data selection and transformation operations easy to express; +for example, the +.IT awk +program +.sp +.ce +.ft 3 +length > 72 +.ft +.sp +prints all input lines whose length exceeds 72 characters; +the program +.ce +.sp +.ft 3 +NF % 2 == 0 +.ft R +.sp +prints all lines with an even number of fields; +and the program +.ce +.sp +.ft 3 +{ $1 = log($1); print } +.ft R +.sp +replaces the first field of each line by its logarithm. +.PP +.IT Awk +patterns may include arbitrary boolean combinations of regular expressions +and of relational operators on strings, numbers, fields, variables, and array elements. +Actions may include the same pattern-matching constructions as in patterns, +as well as +arithmetic and string expressions and assignments, +.UL if-else , +.UL while , +.UL for +statements, +and multiple output streams. +.PP +This report contains a user's guide, a discussion of the design and implementation of +.IT awk , +and some timing statistics. +....It supersedes TM-77-1271-5, dated September 8, 1977. +.AE +.CS 6 1 7 0 1 4 +.if n .ls 2 +.nr PS 9 +.nr VS 11 +.NH +Introduction +.if t .2C +.PP +.IT Awk +is a programming language designed to make +many common +information retrieval and text manipulation tasks +easy to state and to perform. +.PP +The basic operation of +.IT awk +is to scan a set of input lines in order, +searching for lines which match any of a set of patterns +which the user has specified. +For each pattern, an action can be specified; +this action will be performed on each line that matches the pattern. +.PP +Readers familiar with the +.UX +program +.IT grep\| +.[ +unix program manual +.] +will recognize +the approach, although in +.IT awk +the patterns may be more +general than in +.IT grep , +and the actions allowed are more involved than merely +printing the matching line. +For example, the +.IT awk +program +.P1 +{print $3, $2} +.P2 +prints the third and second columns of a table +in that order. +The program +.P1 +$2 ~ /A\||B\||C/ +.P2 +prints all input lines with an A, B, or C in the second field. +.ne 1i +The program +.P1 +$1 != prev { print; prev = $1 } +.P2 +prints all lines in which the first field is different +from the previous first field. +.NH 2 +Usage +.PP +The command +.P1 +awk program [files] +.P2 +executes the +.IT awk +commands in +the string +.UL program +on the set of named files, +or on the standard input if there are no files. +The statements can also be placed in a file +.UL pfile , +and executed by the command +.P1 +awk -f pfile [files] +.P2 +.NH 2 +Program Structure +.PP +An +.IT awk +program is a sequence of statements of the form: +.P1 +.ft I + pattern { action } + pattern { action } + ... +.ft 3 +.P2 +Each line of input +is matched against +each of the patterns in turn. +For each pattern that matches, the associated action +is executed. +When all the patterns have been tested, the next line +is fetched and the matching starts over. +.PP +Either the pattern or the action may be left out, +but not both. +If there is no action for a pattern, +the matching line is simply +copied to the output. +(Thus a line which matches several patterns can be printed several times.) +If there is no pattern for an action, +then the action is performed for every input line. +A line which matches no pattern is ignored. +.PP +Since patterns and actions are both optional, +actions must be enclosed in braces +to distinguish them from patterns. +.NH 2 +Records and Fields +.PP +.IT Awk +input is divided into +``records'' terminated by a record separator. +The default record separator is a newline, +so by default +.IT awk +processes its input a line at a time. +The number of the current record is available in a variable +named +.UL NR . +.PP +Each input record +is considered to be divided into ``fields.'' +Fields are normally separated by +white space \(em blanks or tabs \(em +but the input field separator may be changed, as described below. +Fields are referred to as +.UL "$1, $2," +and so forth, +where +.UL $1 +is the first field, +and +.UL $0 +is the whole input record itself. +Fields may be assigned to. +The number of fields in the current record +is available in a variable named +.UL NF . +.PP +The variables +.UL FS +and +.UL RS +refer to the input field and record separators; +they may be changed at any time to any single character. +The optional command-line argument +\f3\-F\fIc\fR +may also be used to set +.UL FS +to the character +.IT c . +.PP +If the record separator is empty, +an empty input line is taken as the record separator, +and blanks, tabs and newlines are treated as field separators. +.PP +The variable +.UL FILENAME +contains the name of the current input file. +.NH 2 +Printing +.PP +An action may have no pattern, +in which case the action is executed for +all +lines. +The simplest action is to print some or all of a record; +this is accomplished by the +.IT awk +command +.UL print . +The +.IT awk +program +.P1 +{ print } +.P2 +prints each record, thus copying the input to the output intact. +More useful is to print a field or fields from each record. +For instance, +.P1 +print $2, $1 +.P2 +prints the first two fields in reverse order. +Items separated by a comma in the print statement will be separated by the current output field separator +when output. +Items not separated by commas will be concatenated, +so +.P1 +print $1 $2 +.P2 +runs the first and second fields together. +.PP +The predefined variables +.UL NF +and +.UL NR +can be used; +for example +.P1 +{ print NR, NF, $0 } +.P2 +prints each record preceded by the record number and the number of fields. +.PP +Output may be diverted to multiple files; +the program +.P1 +{ print $1 >"foo1"; print $2 >"foo2" } +.P2 +writes the first field, +.UL $1 , +on the file +.UL foo1 , +and the second field on file +.UL foo2 . +The +.UL >> +notation can also be used: +.P1 +print $1 >>"foo" +.P2 +appends the output to the file +.UL foo . +(In each case, +the output files are +created if necessary.) +The file name can be a variable or a field as well as a constant; +for example, +.P1 +print $1 >$2 +.P2 +uses the contents of field 2 as a file name. +.PP +Naturally there is a limit on the number of output files; +currently it is 10. +.PP +Similarly, output can be piped into another process +(on +.UC UNIX +only); for instance, +.P1 +print | "mail bwk" +.P2 +mails the output to +.UL bwk . +.PP +The variables +.UL OFS +and +.UL ORS +may be used to change the current +output field separator and output +record separator. +The output record separator is +appended to the output of the +.UL print +statement. +.PP +.IT Awk +also provides the +.UL printf +statement for output formatting: +.P1 +printf format expr, expr, ... +.P2 +formats the expressions in the list +according to the specification +in +.UL format +and prints them. +For example, +.P1 +printf "%8.2f %10ld\en", $1, $2 +.P2 +prints +.UL $1 +as a floating point number 8 digits wide, +with two after the decimal point, +and +.UL $2 +as a 10-digit long decimal number, +followed by a newline. +No output separators are produced automatically; +you must add them yourself, +as in this example. +The version of +.UL printf +is identical to that used with C. +.[ +C programm language prentice hall 1978 +.] +.NH 1 +Patterns +.PP +A pattern in front of an action acts as a selector +that determines whether the action is to be executed. +A variety of expressions may be used as patterns: +regular expressions, +arithmetic relational expressions, +string-valued expressions, +and arbitrary boolean +combinations of these. +.NH 2 +BEGIN and END +.PP +The special pattern +.UL BEGIN +matches the beginning of the input, +before the first record is read. +The pattern +.UL END +matches the end of the input, +after the last record has been processed. +.UL BEGIN +and +.UL END +thus provide a way to gain control before and after processing, +for initialization and wrapup. +.PP +As an example, the field separator +can be set to a colon by +.P1 +BEGIN { FS = ":" } +.ft I +\&... rest of program ... +.ft 3 +.P2 +Or the input lines may be counted by +.P1 +END { print NR } +.P2 +If +.UL BEGIN +is present, it must be the first pattern; +.UL END +must be the last if used. +.NH 2 +Regular Expressions +.PP +The simplest regular expression is a literal string of characters +enclosed in slashes, +like +.P1 +/smith/ +.P2 +This +is actually a complete +.IT awk +program which +will print all lines which contain any occurrence +of the name ``smith''. +If a line contains ``smith'' +as part of a larger word, +it will also be printed, as in +.P1 +blacksmithing +.P2 +.PP +.IT Awk +regular expressions include the regular expression +forms found in +the +.UC UNIX +text editor +.IT ed\| +.[ +unix program manual +.] +and +.IT grep +(without back-referencing). +In addition, +.IT awk +allows +parentheses for grouping, | for alternatives, +.UL + +for ``one or more'', and +.UL ? +for ``zero or one'', +all as in +.IT lex . +Character classes +may be abbreviated: +.UL [a\-zA\-Z0\-9] +is the set of all letters and digits. +As an example, +the +.IT awk +program +.P1 +/[Aa]ho\||[Ww]einberger\||[Kk]ernighan/ +.P2 +will print all lines which contain any of the names +``Aho,'' ``Weinberger'' or ``Kernighan,'' +whether capitalized or not. +.PP +Regular expressions +(with the extensions listed above) +must be enclosed in slashes, +just as in +.IT ed +and +.IT sed . +Within a regular expression, +blanks and the regular expression +metacharacters are significant. +To turn of the magic meaning +of one of the regular expression characters, +precede it with a backslash. +An example is the pattern +.P1 +/\|\e/\^.\^*\e// +.P2 +which matches any string of characters +enclosed in slashes. +.PP +One can also specify that any field or variable +matches +a regular expression (or does not match it) with the operators +.UL ~ +and +.UL !~ . +The program +.P1 +$1 ~ /[jJ]ohn/ +.P2 +prints all lines where the first field matches ``john'' or ``John.'' +Notice that this will also match ``Johnson'', ``St. Johnsbury'', and so on. +To restrict it to exactly +.UL [jJ]ohn , +use +.P1 +$1 ~ /^[jJ]ohn$/ +.P2 +The caret ^ refers to the beginning +of a line or field; +the dollar sign +.UL $ +refers to the end. +.NH 2 +Relational Expressions +.PP +An +.IT awk +pattern can be a relational expression +involving the usual relational operators +.UL < , +.UL <= , +.UL == , +.UL != , +.UL >= , +and +.UL > . +An example is +.P1 +$2 > $1 + 100 +.P2 +which selects lines where the second field +is at least 100 greater than the first field. +Similarly, +.P1 +NF % 2 == 0 +.P2 +prints lines with an even number of fields. +.PP +In relational tests, if neither operand is numeric, +a string comparison is made; +otherwise it is numeric. +Thus, +.P1 +$1 >= "s" +.P2 +selects lines that begin with an +.UL s , +.UL t , +.UL u , +etc. +In the absence of any other information, +fields are treated as strings, so +the program +.P1 +$1 > $2 +.P2 +will perform a string comparison. +.NH 2 +Combinations of Patterns +.PP +A pattern can be any boolean combination of patterns, +using the operators +.UL \||\|| +(or), +.UL && +(and), and +.UL ! +(not). +For example, +.P1 +$1 >= "s" && $1 < "t" && $1 != "smith" +.P2 +selects lines where the first field begins with ``s'', but is not ``smith''. +.UL && +and +.UL \||\|| +guarantee that their operands +will be evaluated +from left to right; +evaluation stops as soon as the truth or falsehood +is determined. +.NH 2 +Pattern Ranges +.PP +The ``pattern'' that selects an action may also +consist of two patterns separated by a comma, as in +.P1 +pat1, pat2 { ... } +.P2 +In this case, the action is performed for each line between +an occurrence of +.UL pat1 +and the next occurrence of +.UL pat2 +(inclusive). +For example, +.P1 +/start/, /stop/ +.P2 +prints all lines between +.UL start +and +.UL stop , +while +.P1 +NR == 100, NR == 200 { ... } +.P2 +does the action for lines 100 through 200 +of the input. +.NH 1 +Actions +.PP +An +.IT awk +action is a sequence of action statements +terminated by newlines or semicolons. +These action statements can be used to do a variety of +bookkeeping and string manipulating tasks. +.NH 2 +Built-in Functions +.PP +.IT Awk +provides a ``length'' function +to compute the length of a string of characters. +This program prints each record, +preceded by its length: +.P1 +{print length, $0} +.P2 +.UL length +by itself is a ``pseudo-variable'' which +yields the length of the current record; +.UL length(argument) +is a function which yields the length of its argument, +as in +the equivalent +.P1 +{print length($0), $0} +.P2 +The argument may be any expression. +.PP +.IT Awk +also +provides the arithmetic functions +.UL sqrt , +.UL log , +.UL exp , +and +.UL int , +for +square root, +base +.IT e +logarithm, +exponential, +and integer part of their respective arguments. +.PP +The name of one of these built-in functions, +without argument or parentheses, +stands for the value of the function on the +whole record. +The program +.P1 +length < 10 || length > 20 +.P2 +prints lines whose length +is less than 10 or greater +than 20. +.PP +The function +.UL substr(s,\ m,\ n) +produces the substring of +.UL s +that begins at position +.UL m +(origin 1) +and is at most +.UL n +characters long. +If +.UL n +is omitted, the substring goes to the end of +.UL s . +The function +.UL index(s1,\ s2) +returns the position where the string +.UL s2 +occurs in +.UL s1 , +or zero if it does not. +.PP +The function +.UL sprintf(f,\ e1,\ e2,\ ...) +produces the value of the expressions +.UL e1 , +.UL e2 , +etc., +in the +.UL printf +format specified by +.UL f . +Thus, for example, +.P1 +x = sprintf("%8.2f %10ld", $1, $2) +.P2 +sets +.UL x +to the string produced by formatting +the values of +.UL $1 +and +.UL $2 . +.NH 2 +Variables, Expressions, and Assignments +.PP +.IT Awk +variables take on numeric (floating point) +or string values according to context. +For example, in +.P1 +x = 1 +.P2 +.UL x +is clearly a number, while in +.P1 +x = "smith" +.P2 +it is clearly a string. +Strings are converted to numbers and +vice versa whenever context demands it. +For instance, +.P1 +x = "3" + "4" +.P2 +assigns 7 to +.UL x . +Strings which cannot be interpreted +as numbers in a numerical context +will generally have numeric value zero, +but it is unwise to count on this behavior. +.PP +By default, variables (other than built-ins) are initialized to the null string, +which has numerical value zero; +this eliminates the need for most +.UL BEGIN +sections. +For example, the sums of the first two fields can be computed by +.P1 + { s1 += $1; s2 += $2 } +END { print s1, s2 } +.P2 +.PP +Arithmetic is done internally in floating point. +The arithmetic operators are +.UL + , +.UL \- , +.UL \(** , +.UL / , +and +.UL % +(mod). +The C increment +.UL ++ +and +decrement +.UL \-\- +operators are also available, +and so are the assignment operators +.UL += , +.UL \-= , +.UL *= , +.UL /= , +and +.UL %= . +These operators may all be used in expressions. +.NH 2 +Field Variables +.PP +Fields in +.IT awk +share essentially all of the properties of variables _ +they may be used in arithmetic or string operations, +and may be assigned to. +Thus one can +replace the first field with a sequence number like this: +.P1 +{ $1 = NR; print } +.P2 +or +accumulate two fields into a third, like this: +.P1 +{ $1 = $2 + $3; print $0 } +.P2 +or assign a string to a field: +.P1 +{ if ($3 > 1000) + $3 = "too big" + print +} +.P2 +which replaces the third field by ``too big'' when it is, +and in any case prints the record. +.PP +Field references may be numerical expressions, +as in +.P1 +{ print $i, $(i+1), $(i+n) } +.P2 +Whether a field is deemed numeric or string depends on context; +in ambiguous cases like +.P1 +if ($1 == $2) ... +.P2 +fields are treated as strings. +.PP +Each input line is split into fields automatically as necessary. +.br +.ne 1i +It is also possible to split any variable or string +into fields: +.P1 +n = split(s, array, sep) +.P2 +splits the +the string +.UL s +into +.UL array[1] , +\&..., +.UL array[n] . +The number of elements found is returned. +If the +.UL sep +argument is provided, it is used as the field separator; +otherwise +.UL FS +is used as the separator. +.NH 2 +String Concatenation +.PP +Strings may be concatenated. +For example +.P1 +length($1 $2 $3) +.P2 +returns the length of the first three fields. +Or in a +.UL print +statement, +.P1 +print $1 " is " $2 +.P2 +prints +the two fields separated by `` is ''. +Variables and numeric expressions may also appear in concatenations. +.NH 2 +Arrays +.PP +Array elements are not declared; +they spring into existence by being mentioned. +Subscripts may have +.ul +any +non-null +value, including non-numeric strings. +As an example of a conventional numeric subscript, +the statement +.P1 +x[NR] = $0 +.P2 +assigns the current input record to +the +.UL NR -th +element of the array +.UL x . +In fact, it is possible in principle (though perhaps slow) +to process the entire input in a random order with the +.IT awk +program +.P1 + { x[NR] = $0 } +END { \fI... program ...\fP } +.P2 +The first action merely records each input line in +the array +.UL x . +.PP +Array elements may be named by non-numeric values, +which gives +.IT awk +a capability rather like the associative memory of +Snobol tables. +Suppose the input contains fields with values like +.UL apple , +.UL orange , +etc. +Then the program +.P1 +/apple/ { x["apple"]++ } +/orange/ { x["orange"]++ } +END { print x["apple"], x["orange"] } +.P2 +increments counts for the named array elements, +and prints them at the end of the input. +.NH 2 +Flow-of-Control Statements +.PP +.IT Awk +provides the basic flow-of-control statements +.UL if-else , +.UL while , +.UL for , +and statement grouping with braces, as in C. +We showed the +.UL if +statement in section 3.3 without describing it. +The condition in parentheses is evaluated; +if it is true, the statement following the +.UL if +is done. +The +.UL else +part is optional. +.PP +The +.UL while +statement is exactly like that of C. +For example, to print all input fields one per line, +.P1 +i = 1 +while (i <= NF) { + print $i + ++i +} +.P2 +.PP +The +.UL for +statement is also exactly that of C: +.P1 +for (i = 1; i <= NF; i++) + print $i +.P2 +does the same job as the +.UL while +statement above. +.PP +There is an alternate form of the +.UL for +statement which is suited for accessing the +elements of an associative array: +.P1 +for (i in array) + \fIstatement\f3 +.P2 +does +.ul +statement +with +.UL i +set in turn to each element of +.UL array . +The elements are accessed in an apparently random order. +Chaos will ensue if +.UL i +is altered, or if any new elements are +accessed during the loop. +.PP +The expression in the condition part of an +.UL if , +.UL while +or +.UL for +can include relational operators like +.UL < , +.UL <= , +.UL > , +.UL >= , +.UL == +(``is equal to''), +and +.UL != +(``not equal to''); +regular expression matches with the match operators +.UL ~ +and +.UL !~ ; +the logical operators +.UL \||\|| , +.UL && , +and +.UL ! ; +and of course parentheses for grouping. +.PP +The +.UL break +statement causes an immediate exit +from an enclosing +.UL while +or +.UL for ; +the +.UL continue +statement +causes the next iteration to begin. +.PP +The statement +.UL next +causes +.IT awk +to skip immediately to +the next record and begin scanning the patterns from the top. +The statement +.UL exit +causes the program to behave as if the end of the input +had occurred. +.PP +Comments may be placed in +.IT awk +programs: +they begin with the character +.UL # +and end with the end of the line, +as in +.P1 +print x, y # this is a comment +.P2 +.NH +Design +.PP +The +.UX +system +already provides several programs that +operate by passing input through a +selection mechanism. +.IT Grep , +the first and simplest, merely prints all lines which +match a single specified pattern. +.IT Egrep +provides more general patterns, i.e., regular expressions +in full generality; +.IT fgrep +searches for a set of keywords with a particularly fast algorithm. +.IT Sed\| +.[ +unix programm manual +.] +provides most of the editing facilities of +the editor +.IT ed , +applied to a stream of input. +None of these programs provides +numeric capabilities, +logical relations, +or variables. +.PP +.IT Lex\| +.[ +lesk lexical analyzer cstr +.] +provides general regular expression recognition capabilities, +and, by serving as a C program generator, +is essentially open-ended in its capabilities. +The use of +.IT lex , +however, requires a knowledge of C programming, +and a +.IT lex +program must be compiled and loaded before use, +which discourages its use for one-shot applications. +.PP +.IT Awk +is an attempt +to fill in another part of the matrix of possibilities. +It +provides general regular expression capabilities +and an implicit input/output loop. +But it also provides convenient numeric processing, +variables, +more general selection, +and control flow in the actions. +It +does not require compilation or a knowledge of C. +Finally, +.IT awk +provides +a convenient way to access fields within lines; +it is unique in this respect. +.PP +.IT Awk +also tries to integrate strings and numbers +completely, +by treating all quantities as both string and numeric, +deciding which representation is appropriate +as late as possible. +In most cases the user can simply ignore the differences. +.PP +Most of the effort in developing +.I awk +went into deciding what +.I awk +should or should not do +(for instance, it doesn't do string substitution) +and what the syntax should be +(no explicit operator for concatenation) +rather +than on writing or debugging the code. +We have tried +to make the syntax powerful +but easy to use and well adapted +to scanning files. +For example, +the absence of declarations and implicit initializations, +while probably a bad idea for a general-purpose programming language, +is desirable in a language +that is meant to be used for tiny programs +that may even be composed on the command line. +.PP +In practice, +.IT awk +usage seems to fall into two broad categories. +One is what might be called ``report generation'' \(em +processing an input to extract counts, +sums, sub-totals, etc. +This also includes the writing of trivial +data validation programs, +such as verifying that a field contains only numeric information +or that certain delimiters are properly balanced. +The combination of textual and numeric processing is invaluable here. +.PP +A second area of use is as a data transformer, +converting data from the form produced by one program +into that expected by another. +The simplest examples merely select fields, perhaps with rearrangements. +.NH +Implementation +.PP +The actual implementation of +.IT awk +uses the language development tools available +on the +.UC UNIX +operating system. +The grammar is specified with +.IT yacc ; +.[ +yacc johnson cstr +.] +the lexical analysis is done by +.IT lex ; +the regular expression recognizers are +deterministic finite automata +constructed directly from the expressions. +An +.IT awk +program is translated into a +parse tree which is then directly executed +by a simple interpreter. +.PP +.IT Awk +was designed for ease of use rather than processing speed; +the delayed evaluation of variable types +and the necessity to break input +into fields makes high speed difficult to achieve in any case. +Nonetheless, +the program has not proven to be unworkably slow. +.PP +Table I below shows the execution (user + system) time +on a PDP-11/70 of +the +.UC UNIX +programs +.IT wc , +.IT grep , +.IT egrep , +.IT fgrep , +.IT sed , +.IT lex , +and +.IT awk +on the following simple tasks: +.IP "\ \ 1." +count the number of lines. +.IP "\ \ 2." +print all lines containing ``doug''. +.IP "\ \ 3." +print all lines containing ``doug'', ``ken'' or ``dmr''. +.IP "\ \ 4." +print the third field of each line. +.IP "\ \ 5." +print the third and second fields of each line, in that order. +.IP "\ \ 6." +append all lines containing ``doug'', ``ken'', and ``dmr'' +to files ``jdoug'', ``jken'', and ``jdmr'', respectively. +.IP "\ \ 7." +print each line prefixed by ``line-number\ :\ ''. +.IP "\ \ 8." +sum the fourth column of a table. +.LP +The program +.IT wc +merely counts words, lines and characters in its input; +we have already mentioned the others. +In all cases the input was a file containing +10,000 lines +as created by the +command +.IT "ls \-l" ; +each line has the form +.P1 +-rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx +.P2 +The total length of this input is +452,960 characters. +Times for +.IT lex +do not include compile or load. +.PP +As might be expected, +.IT awk +is not as fast as the specialized tools +.IT wc , +.IT sed , +or the programs in the +.IT grep +family, +but +is faster than the more general tool +.IT lex . +In all cases, the tasks were +about as easy to express as +.IT awk +programs +as programs in these other languages; +tasks involving fields were +considerably easier to express as +.IT awk +programs. +Some of the test programs are shown in +.IT awk , +.IT sed +and +.IT lex . +.[ +$LIST$ +.] +.1C +.TS +center; +c c c c c c c c c +c c c c c c c c c +c|n|n|n|n|n|n|n|n|. + Task +Program 1 2 3 4 5 6 7 8 +_ +\fIwc\fR 8.6 +\fIgrep\fR 11.7 13.1 +\fIegrep\fR 6.2 11.5 11.6 +\fIfgrep\fR 7.7 13.8 16.1 +\fIsed\fR 10.2 11.6 15.8 29.0 30.5 16.1 +\fIlex\fR 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8 +\fIawk\fR 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1 +_ +.TE +.sp +.ce +\fBTable I.\fR Execution Times of Programs. (Times are in sec.) +.sp 2 +.2C +.PP +The programs for some of these jobs are shown below. +The +.IT lex +programs are generally too long to show. +.LP +AWK: +.LP +.P1 +1. END {print NR} +.P2 +.P1 +2. /doug/ +.P2 +.P1 +3. /ken|doug|dmr/ +.P2 +.P1 +4. {print $3} +.P2 +.P1 +5. {print $3, $2} +.P2 +.P1 +6. /ken/ {print >"jken"} + /doug/ {print >"jdoug"} + /dmr/ {print >"jdmr"} +.P2 +.P1 +7. {print NR ": " $0} +.P2 +.P1 +8. {sum = sum + $4} + END {print sum} +.P2 +.LP +SED: +.LP +.P1 +1. $= +.P2 +.P1 +2. /doug/p +.P2 +.P1 +3. /doug/p + /doug/d + /ken/p + /ken/d + /dmr/p + /dmr/d +.P2 +.P1 +4. /[^ ]* [ ]*[^ ]* [ ]*\e([^ ]*\e) .*/s//\e1/p +.P2 +.P1 +5. /[^ ]* [ ]*\e([^ ]*\e) [ ]*\e([^ ]*\e) .*/s//\e2 \e1/p +.P2 +.P1 +6. /ken/w jken + /doug/w jdoug + /dmr/w jdmr +.P2 +.LP +LEX: +.LP +.P1 +1. %{ + int i; + %} + %% + \en i++; + . ; + %% + yywrap() { + printf("%d\en", i); + } +.P2 +.P1 +2. %% + ^.*doug.*$ printf("%s\en", yytext); + . ; + \en ; +.P2 |