diff options
author | Theo de Raadt <deraadt@cvs.openbsd.org> | 1995-10-18 08:53:40 +0000 |
---|---|---|
committer | Theo de Raadt <deraadt@cvs.openbsd.org> | 1995-10-18 08:53:40 +0000 |
commit | d6583bb2a13f329cf0332ef2570eb8bb8fc0e39c (patch) | |
tree | ece253b876159b39c620e62b6c9b1174642e070e /share/doc/smm |
initial import of NetBSD tree
Diffstat (limited to 'share/doc/smm')
45 files changed, 13336 insertions, 0 deletions
diff --git a/share/doc/smm/00.contents b/share/doc/smm/00.contents new file mode 100644 index 00000000000..ed03c7a6990 --- /dev/null +++ b/share/doc/smm/00.contents @@ -0,0 +1,161 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)00.contents 8.1 (Berkeley) 7/5/93 +.\" +.OH '''SMM Contents' +.EH 'SMM Contents''' +.TL +UNIX System Manager's Manual (SMM) +.sp +\s-24.4 Berkeley Software Distribution\s+2 +.sp +\fRJune, 1993\fR +.PP +This volume contains manual pages and supplementary documents useful to system +administrators. +The information in these documents applies to +the 4.4BSD system as distributed by U.C. Berkeley. +.SH +Reference Manual \- Section 8 +.tl '''(8)' +.IP +Section 8 of the UNIX Programmer's Manual contains information related to +system operation, administration, and maintenance. +.SH +System Installation and Administration +.IP +.tl 'Installing and Operating 4.4BSD''SMM:1' +.QP +The definitive reference document for those occasions when +you find you need to start over again. + +.IP +.tl 'Building 4.4BSD Kernels with \fIConfig\fP''SMM:2' +.QP +In-depth discussions of the use and operation of the \fIconfig\fP +program, and how to build your very own Unix kernel. + +.IP +.tl 'Fsck \- The UNIX File System Check Program''SMM:3' +.QP +A reference document for using the \fIfsck\fP program during +times of file system distress. + +.IP +.tl 'Disc Quotas in a UNIX Environment''SMM:4' +.QP +A light introduction to the techniques +for limiting the use of disc resources. + +.IP +.tl 'A Fast File System for UNIX''SMM:5' +.QP +A description of the 4.4BSD file system organization, +design and implementation. + +.IP +.tl 'The 4.4BSD NFS Implementation''SMM:6' +.QP +An overview of the design, implementation, and use of NFS on 4.4BSD. + +.IP +.tl 'Line Printer Spooler Manual''SMM:7' +.QP +This document describes the structure and installation procedure +for the line printer spooling system. + +.IP +.tl 'Sendmail Installation and Operation Guide''SMM:8' +.QP +The last word in installing and operating the \fIsendmail\fP program. + +.ne 3 +.IP +.tl 'Sendmail \- An Internetwork Mail Router''SMM:9' +.QP +An overview document on the design and implementation of \fIsendmail\fP. + +.IP +.tl 'Name Server Operations Guide for BIND''SMM:10' +.QP +Setting up and operating the name to Internet addressing software. +If you have a network this will be of interest. + +.IP +.tl 'Timed Installation and Operation Guide''SMM:11' +.QP +Describes how to maintain time synchronization between machines +in a local network. + +.IP +.tl 'The Berkeley UNIX Time Synchronization Protocol''SMM:12' +.QP +The protocols and algorithms used by timed, +the network time synchronization daemon. + +.IP +.tl 'AMD \- The 4.4BSD Automounter''SMM:13' +.QP +Automatically mounting file systems on demand. + +.IP +.tl 'Installation and Operation of UUCP''SMM:14' +.QP +Describes the implementation of uucp; for the installer and administrator. + +.IP +.tl 'A Dial\-Up Network of UNIX Systems''SMM:15' +.QP +Describes UUCP, a program for communicating files between UNIX systems. + +.IP +.tl 'On the Security of UNIX''SMM:16' +.QP +Hints on how to break UNIX, and how to avoid your system being broken. + +.IP +.tl 'Password Security \- A Case History''SMM:17' +.QP +How the bad guys used to be able to break the password algorithm, and why +they cannot now (at least not so easily). + +.IP +.tl 'Networking Implementation Notes, 4.4BSD Edition''SMM:18' +.QP +A concise description of the system interfaces used within the +networking subsystem. + +.IP +.tl 'The PERL Programming Language''SMM:19' +.QP +The Practical Extraction and Report Language is ideal for +writing those pesky adminitration scripts. diff --git a/share/doc/smm/01.setup/0.t b/share/doc/smm/01.setup/0.t new file mode 100644 index 00000000000..2f880eb2e70 --- /dev/null +++ b/share/doc/smm/01.setup/0.t @@ -0,0 +1,131 @@ +.\" Copyright (c) 1988, 1993 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)0.t 8.1 (Berkeley) 7/27/93 +.\" +.ds Ux \s-1UNIX\s0 +.ds Bs \s-1BSD\s0 +.\" Current version: +.ds 4B 4.4\*(Bs +.ds Ps 4.3\*(Bs +.\" tape and disk naming +.ds Mt mt +.ds Dk sd +.ds Dn disk +.ds Pa c +.\" block size used on the tape +.ds Bb 10240 +.ds Bz 20 +.\" document date +.ds Dy July 27, 1993 +.de Sm +\s-1\\$1\s0\\$2 +.. +.de Pn \" pathname +\f(CW\\$1\fP\\$2 +.. +.de Li \" literal +\f(CW\\$1\fP\\$2 +.. +.de I \" italicize first arg +\fI\\$1\fP\^\\$2 +.. +.de Xr \" manual reference +\fI\\$1\fP\^\\$2 +.. +.de Fn \" function +\fI\\$1\fP\^()\\$2 +.. +.bd S B 3 +.EH 'SMM:1-%''Installing and Operating \*(4B UNIX' +.OH 'Installing and Operating \*(4B UNIX''SMM:1-%' +.de Sh +.NH \\$1 +\\$2 +.nr PD .1v +.XS \\n% +.ta 0.6i +\\*(SN \\$2 +.XE +.nr PD .3v +.. +.TL +Installing and Operating \*(4B UNIX +.br +\*(Dy +.AU +Marshall Kirk McKusick +.AU +Keith Bostic +.AU +Michael J. Karels +.AU +Samuel J. Leffler +.AI +Computer Systems Research Group +Department of Electrical Engineering and Computer Science +University of California, Berkeley +Berkeley, California 94720 +(415) 642-7780 +.AU +Mike Hibler +.AI +Center for Software Science +Department of Computer Science +University of Utah +Salt Lake City, Utah 84112 +(801) 581-5017 +.AB +.PP +This document contains instructions for the +installation and operation of the +\*(4B release of UNIX\** +as distributed by The University of California at Berkeley. +.FS +UNIX is a registered trademark of USL in the USA and some other countries. +.FE +.PP +It discusses procedures for installing UNIX on a new machine, +and for upgrading an existing \*(Ps UNIX system to the new release. +An explanation of how to lay out filesystems on available disks +and the space requirements for various parts of the system are given. +A brief overview of the major changes to +the system between \*(Ps and \*(4B are outlined. +An explanation of how to set up terminal lines and user accounts, +and how to do system-specific tailoring is provided. +A description of how to install and configure the \*(4B networking +facilities is included. +Finally, the document details system operation procedures: +shutdown and startup, filesystem backup procedures, +resource control, performance monitoring, and procedures for recompiling +and reinstalling system software. +.AE +.bp +3 diff --git a/share/doc/smm/01.setup/1.t b/share/doc/smm/01.setup/1.t new file mode 100644 index 00000000000..2f71b772eab --- /dev/null +++ b/share/doc/smm/01.setup/1.t @@ -0,0 +1,172 @@ +.\" Copyright (c) 1988, 1993 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)1.t 8.1 (Berkeley) 7/27/93 +.\" +.ds lq `` +.ds rq '' +.ds LH "Installing/Operating \*(4B +.ds RH Introduction +.ds CF \*(Dy +.LP +.bp +.Sh 1 "Introduction" +.PP +This document explains how to install the \*(4B Berkeley +version of UNIX on your system. +The filesystem format is compatible with \*(Ps +and it will only be necessary for you to do a full bootstrap +procedure if you are installing the release on a new machine. +The object file formats are completely different from the System +V release, so the most straightforward procedure for upgrading +a System V system is to do a full bootstrap. +.PP +The full bootstrap procedure +is outlined in section 2; the process starts with copying a filesystem +image onto a new disk. +This filesystem is then booted and used to extract the remainder of the +system binaries and sources from the archives on the tape(s). +.PP +The technique for upgrading a \*(Ps system is described +in section 3 of this document. +The upgrade procedure involves extracting system binaries +onto new root and +.Pn /usr +filesystems and merging local +configuration files into the new system. +User filesystems may be upgraded in place. +Most \*(Ps binaries may be used with \*(4B in the course +of the conversion. +It is desirable to recompile local sources after the conversion, +as the new compiler (GCC) provides superior code optimization. +Consult section 3.5 for a description of some of the differences +between \*(Ps and \*(4B. +.Sh 2 "Distribution format" +.PP +The distribution comes in two formats: +.DS +(3)\0\0 6250bpi 2400' 9-track magnetic tapes, or +(1)\0\0 8mm Exabyte tape +.DE +.PP +If you have the facilities, we \fBstrongly\fP recommend copying the +magnetic tape(s) in the distribution kit to guard against disaster. +The tapes contain \*(Bb-byte records. +There are interspersed tape marks; +end-of-tape is signaled by a double end-of-file. +The first file on the tape is architecture dependent. +Additional files on the tape(s) +contain tape archive images of the system binaries and sources (see +.Xr tar (1)\**). +.FS +References of the form \fIX\fP(Y) mean the entry named +\fIX\fP in section Y of the ``UNIX Programmer's Manual''. +.FE +See the tape label for a description of the contents +and format of each individual tape. +.Sh 2 "UNIX device naming" +.PP +Device names have a different syntax depending on whether you are talking +to the standalone system or a running UNIX kernel. +The standalone system syntax is currently architecture dependent and is +described in the various architecture specific sections as applicable. +When not running standalone, devices are available via files in the +.Pn /dev/ +directory. +The file name typically encodes the device type, its logical unit and +a partition within that unit. +For example, +.Pn /dev/sd2b +refers to the second partition (``b'') of +SCSI (``sd'') drive number ``2'', while +.Pn /dev/rmt0 +refers to the raw (``r'') interface of 9-track tape (``mt'') unit ``0''. +.PP +The mapping of physical addressing information (e.g. controller, target) +to a logical unit number is dependent on the system configuration. +In all simple cases, where only a single controller is present, a drive +with physical unit number 0 (e.g., as determined by its unit +specification, either unit plug or other selection mechanism) +will be called unit 0 in its UNIX file name. +This is not, however, strictly +necessary, since the system has a level of indirection in this naming. +If there are multiple controllers, the disk unit numbers will normally +be counted sequentially across controllers. This can be taken +advantage of to make the system less dependent on the interconnect +topology, and to make reconfiguration after hardware failure easier. +.PP +Each UNIX physical disk is divided into at most 8 logical disk partitions, +each of which may occupy any consecutive cylinder range on the physical +device. The cylinders occupied by the 8 partitions for each drive type +are specified initially in the disk description file +.Pn /etc/disktab +(c.f. +.Xr disktab (5)). +The partition information and description of the +drive geometry are written in one of the first sectors of each disk with the +.Xr disklabel (8) +program. Each partition may be used for either a +raw data area such as a paging area or to store a UNIX filesystem. +It is conventional for the first partition on a disk to be used +to store a root filesystem, from which UNIX may be bootstrapped. +The second partition is traditionally used as a paging area, and the +rest of the disk is divided into spaces for additional ``mounted +filesystems'' by use of one or more additional partitions. +.Sh 2 "UNIX devices: block and raw" +.PP +UNIX makes a distinction between ``block'' and ``raw'' (character) +devices. Each disk has a block device interface where +the system makes the device byte addressable and you can write +a single byte in the middle of the disk. The system will read +out the data from the disk sector, insert the byte you gave it +and put the modified data back. The disks with the names +.Pn /dev/xx0[a-h] , +etc., are block devices. +There are also raw devices available. +These have names like +.Pn /dev/rxx0[a-h] , +the ``r'' here standing for ``raw''. +Raw devices bypass the buffer cache and use DMA directly to/from +the program's I/O buffers; +they are normally restricted to full-sector transfers. +In the bootstrap procedures we +will often suggest using the raw devices, because these tend +to work faster. +Raw devices are used when making new filesystems, +when checking unmounted filesystems, +or for copying quiescent filesystems. +The block devices are used to mount filesystems. +.PP +You should be aware that it is sometimes important whether to use +the character device (for efficiency) or not (because it would not +work, e.g. to write a single byte in the middle of a sector). +Do not change the instructions by using the wrong type of device +indiscriminately. diff --git a/share/doc/smm/01.setup/2.t b/share/doc/smm/01.setup/2.t new file mode 100644 index 00000000000..ddc4f7edb29 --- /dev/null +++ b/share/doc/smm/01.setup/2.t @@ -0,0 +1,1658 @@ +.\" Copyright (c) 1988, 1993 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)2.t 8.1 (Berkeley) 7/27/93 +.\" +.ds lq `` +.ds rq '' +.ds LH "Installing/Operating \*(4B +.ds RH Bootstrapping +.ds CF \*(Dy +.Sh 1 "Bootstrap procedure" +.PP +This section explains the bootstrap procedure that can be used +to get the kernel supplied with this distribution running on your machine. +If you are not currently running \*(Ps you will +have to do a full bootstrap. +Section 3 describes how to upgrade a \*(Ps system. +An understanding of the operations used in a full bootstrap +is helpful in doing an upgrade as well. +In either case, it is highly desirable to read and understand +the remainder of this document before proceeding. +.PP +The distribution supports a somewhat wider set of machines than +those for which we have built binaries. +The architectures that are supported only in source form include: +.IP \(bu +Intel 386/486-based machines (ISA/AT or EISA bus only) +.IP \(bu +Sony News MIPS-based workstations +.IP \(bu +Omron Luna 68000-based workstations +.LP +If you wish to run one of these architectures, +you will have to build a cross compilation environment. +Note that the distribution does +.B not +include the machine support for the Tahoe and VAX architectures +found in previous BSD distributions. +Our primary development environment is the HP9000/300 series machines. +The other architectures are developed and supported by +people outside the university. +Consequently, we are not able to directly test or maintain these +other architectures, so cannot comment on their robustness, +reliability, or completeness. +.Sh 2 "Bootstrapping from the tape" +.LP +The set of files on the distribution tape are as follows: +.IP 1) +A +.Xr dd (1) +(HP300), +.Xr tar (1) +(DECstation), or +.Xr dump (8) +(SPARC) image of the root filesystem +.IP 2) +A +.Xr tar +image of the +.Pn /var +filesystem +.IP 3) +A +.Xr tar +image of the +.Pn /usr +filesystem +.IP 4) +A +.Xr tar +image of +.Pn /usr/src/sys +.IP 5) +A +.Xr tar +image of +.Pn /usr/src +except sys and contrib +.IP 6) +A +.Xr tar +image of +.Pn /usr/src/contrib +.IP 7) +(8mm Exabyte tape distributions only) +A +.Xr tar +image of +.Pn /usr/src/X11R5 +.LP +The tape bootstrap procedure used to create a +working system involves the following major steps: +.IP 1) +Transfer a bootable root filesystem from the tape to a disk +and get it booted and running. +.IP 2) +Build and restore the +.Pn /var +and +.Pn /usr +filesystems from tape with +.Xr tar (1). +.IP 3) +Extract the system and utility source files as desired. +.PP +The following sections describe the above steps in detail. +The details of the first step vary between architectures. +The specific steps for the HP300, SPARC, and DECstation are +given in the next three sections respectively. +You should follow the instructions for your particular architecture. +In all sections, +commands you are expected to type are shown in italics, while that +information printed by the system is shown emboldened. +.Sh 2 "Booting the HP300" +.Sh 3 "Supported hardware" +.LP +The hardware supported by \*(4B for the HP300/400 is as follows: +.TS +center box; +lw(1i) lw(4i). +CPU's T{ +68020 based (318, 319, 320, 330 and 350), +68030 based (340, 345, 360, 370, 375, 400) and +68040 based (380, 425, 433). +T} +_ +DISK's T{ +HP-IB/CS80 (7912, 7914, 7933, 7936, 7945, 7957, 7958, 7959, 2200, 2203) +and SCSI-I (including magneto-optical). +T} +_ +TAPE's T{ +Low-density CS80 cartridge (7914, 7946, 9144), +high-density CS80 cartridge (9145), +HP SCSI DAT and +SCSI Exabyte. +T} +_ +RS232 T{ +98644 built-in single-port, 98642 4-port and 98638 8-port interfaces. +T} +_ +NETWORK T{ +98643 internal and external LAN cards. +T} +_ +GRAPHICS T{ +Terminal emulation and raw frame buffer support for +98544 / 98545 / 98547 (Topcat color & monochrome), +98548 / 98549 / 98550 (Catseye color & monochrome), +98700 / 98710 (Gatorbox), +98720 / 98721 (Renaissance), +98730 / 98731 (DaVinci) and +A1096A (Hyperion monochrome). +T} +_ +INPUT T{ +General interface supporting all HIL devices. +(e.g. keyboard, 2 and 3 button mice, ID module, ...) +T} +_ +MISC T{ +Battery-backed real time clock, +builtin and 98625A/B HP-IB interfaces, +builtin and 98658A SCSI interfaces, +serial printers and plotters on HP-IB, +and SCSI autochanger device. +T} +.TE +.LP +Major items that are not supported +include the 310 and 332 CPU's, 400 series machines +configured for Domain/OS, EISA and VME bus adaptors, audio, the centronics +port, 1/2" tape drives (7980), CD-ROM, and the PVRX/TVRX 3D graphics displays. +.Sh 3 "Standalone device file naming" +.LP +The standalone system device name syntax on the HP300 is of the form: +.DS +xx(a,c,u,p) +.DE +where +\fIxx\fP is the device type, +\fIa\fP specifies the adaptor to use, +\fIc\fP the controller, +\fIu\fP the unit, and +\fIp\fP a partition. +The \fIdevice type\fP differentiates the various disks and tapes and is one of: +``rd'' for HP-IB CS80 disks, +``ct'' for HP-IB CS80 cartridge tapes, or +``sd'' for SCSI-I disks +(SCSI-I tapes are currently not supported). +The \fIadaptor\fP field is a logical HP-IB or SCSI bus adaptor card number. +This will typically be +0 for SCSI disks, +0 for devices on the ``slow'' HP-IB interface (usually tapes) and +1 for devices on the ``fast'' HP-IB interface (usually disks). +To get a complete mapping of physical (select-code) to logical card numbers +just type a ^C at the standalone prompt. +The \fIcontroller\fP field is the disk or tape's target number on the +HP-IB or SCSI bus. +For SCSI the range is 0 to 6 (7 is the adaptor address) and +for HP-IB the range is 0 to 7. +The \fIunit\fP field is unused and should be 0. +The \fIpartition\fP field is interpreted differently for tapes +and disks: for disks it is a disk partition (in the range 0-7), +and for tapes it is a file number offset on the tape. +Thus, partition 2 of a SCSI disk drive at target 3 on SCSI bus 1 +would be ``sd(1,3,0,2)''. +If you have only one of any type bus adaptor, you may omit the adaptor +and controller numbers; +e.g. ``sd(0,2)'' could be used instead of ``sd(0,0,0,2)''. +The following examples always use the full syntax for clarity. +.Sh 3 "The procedure" +.LP +The basic steps involved in bringing up the HP300 are as follows: +.IP 1) +Obtain a second disk and format it, if necessary. +.IP 2) +Copy a root filesystem from the +tape onto the beginning of the disk. +.IP 3) +Boot the UNIX system on the new disk. +.IP 4) +(Optional) Build a root filesystem optimized for your disk. +.IP 5) +Label the disks with the +.Xr disklabel (8) +program. +.Sh 4 "Step 1: selecting and formatting a disk" +.PP +For your first system you will have to obtain a formatted disk +of a type given in the ``supported hardware'' list above. +If you want to load an entire binary system +(i.e., everything except +.Pn /usr/src ), +on the single disk you will need a minimum of 290MB, +ruling out anything smaller than a 7959B/S disk. +The disklabel included in the bootstrap root image is laid out +to accommodate this scenario. +Note that an HP SCSI magneto-optical disk will work fine for this case. +\*(4B will boot and run (albeit slowly) using one. +If you want to load source on a single disk system, +you will need at least 640MB (at least a 2213A SCSI or 2203A HP-IB disk). +A disk as small as the 7945A (54MB) can be used for the bootstrap +procedure but will hold only the root and primary swap partitions. +If you plan to use multiple disks, +refer to section 2.5 for suggestions on partitioning. +.PP +After selecting a disk, you may need to format it. +Since most HP disk drives come pre-formatted +(except optical media) +you probably will not, but if necessary, +you can format a disk under HP-UX using the +.Xr mediainit (1m) +program. +Once you have \*(4B up and running on one machine you can use the +.Xr scsiformat (8) +program to format additional SCSI disks. +Any additional HP-IB disks will have to be formatted using HP-UX. +.Sh 4 "Step 2: copying the root filesystem from tape to disk" +.PP +Once you have a formatted second disk you can use the +.Xr dd (1) +command under HP-UX to copy the root filesystem image from +the tape to the beginning of the second disk. +For HP's, the root filesystem image is the first file on the tape. +It includes a disklabel and bootblock along with the root filesystem. +An example command to copy the image from tape to the beginning of a disk is: +.DS +.ft CW +dd if=/dev/rmt/0m of=/dev/rdsk/1s0 bs=\*(Bzb +.DE +The actual special file syntax may vary depending on unit numbers and +the version of HP-UX that is running. +Consult the HP-UX +.Xr mt (7) +and +.Xr disk (7) +man pages for details. +.PP +Note that if you have a SCSI disk, you don't necessarily have to use +HP-UX (or an HP) to create the boot disk. +Any machine and operating system that will allow you to copy the +raw disk image out to block 0 of the disk will do. +.PP +If you have only a single machine with a single disk, +you may still be able to install and boot \*(4B if you have an +HP-IB cartridge tape drive. +If so, you can use a more difficult approach of booting a +standalone copy program from the tape, and using that to copy the +root filesystem image from the tape to the disk. +To do this, you need to extract the first file of the distribution tape +(the root image), copy it over to a machine with a cartridge drive +and then copy the image onto tape. +For example: +.DS +.ft CW +dd if=/dev/rst0 of=bootimage bs=\*(Bzb +rcp bootimage foo:/tmp/bootimage +<login to foo> +dd if=/tmp/bootimage of=/dev/rct/0m bs=\*(Bzb +.DE +Once this tape is created you can boot and run the standalone tape +copy program from it. +The copy program is loaded just as any other program would be loaded +by the bootrom in ``attended'' mode: +reset the CPU, +hold down the space bar until the word ``Keyboard'' appears in the +installed interface list, and +enter the menu selection for SYS_TCOPY. +Once loaded and running: +.DS +.TS +lw(2i) l. +\fBFrom:\fP \fI^C\fP (control-C to see logical adaptor assignments) +\fBhpib0 at sc7\fP +\fBscsi0 at sc14\fP +\fBFrom:\fP \fIct(0,7,0,0)\fP (HP-IB tape, target 7, first tape file) +\fBTo:\fP \fIsd(0,0,0,2)\fP (SCSI disk, target 0, third partition) +\fBCopy completed: 1728 records copied\fP +.TE +.DE +.LP +This copy will likely take 30 minutes or more. +.Sh 4 "Step 3: booting the root filesystem" +.PP +You now have a bootable root filesystem on the disk. +If you were previously running with two disks, +it would be best if you shut down the machine and turn off power on +the HP-UX drive. +It will be less confusing and it will eliminate any chance of accidentally +destroying the HP-UX disk. +If you used a cartridge tape for booting you should also unload the tape +at this point. +Whether you booted from tape or copied from disk you should now reboot +the machine and do another attended boot (see previous section), +this time with SYS_TBOOT. +Once loaded and running the boot program will display the CPU type and +prompt for a kernel file to boot: +.DS +.B +HP433 CPU +Boot +.R +\fB:\fP \fI/vmunix\fP +.DE +.LP +After providing the kernel name, the machine will boot \*(4B with +output that looks about like this: +.DS +.B +597480+34120+139288 start 0xfe8019ec +Copyright (c) 1982, 1986, 1989, 1991, 1993 + The Regents of the University of California. +Copyright (c) 1992 Hewlett-Packard Company +Copyright (c) 1992 Motorola Inc. +All rights reserved. + +4.4BSD UNIX #1: Tue Jul 20 11:40:36 PDT 1993 + mckusick@vangogh.CS.Berkeley.EDU:/usr/obj/sys/compile/GENERIC.hp300 +HP9000/433 (33MHz MC68040 CPU+MMU+FPU, 4k on-chip physical I/D caches) +real mem = xxx +avail mem = ### +using ### buffers containing ### bytes of memory +(... information about available devices ...) +root device? +.R +.DE +.PP +The first three numbers are printed out by the bootstrap program and +are the sizes of different parts of the system (text, initialized and +uninitialized data). The system also allocates several system data +structures after it starts running. The sizes of these structures are +based on the amount of available memory and the maximum count of active +users expected, as declared in a system configuration description. This +will be discussed later. +.PP +UNIX itself then runs for the first time and begins by printing out a banner +identifying the release and +version of the system that is in use and the date that it was compiled. +.PP +Next the +.I mem +messages give the +amount of real (physical) memory and the +memory available to user programs +in bytes. +For example, if your machine has 16Mb bytes of memory, then +\fBxxx\fP will be 16777216. +.PP +The messages that come out next show what devices were found on +the current processor. These messages are described in +.Xr autoconf (4). +The distributed system may not have +found all the communications devices you have +or all the mass storage peripherals you have, especially +if you have more than +two of anything. You will correct this when you create +a description of your machine from which to configure a site-dependent +version of UNIX. +The messages printed at boot here contain much of the information +that will be used in creating the configuration. +In a correctly configured system most of the information +present in the configuration description +is printed out at boot time as the system verifies that each device +is present. +.PP +The \*(lqroot device?\*(rq prompt was printed by the system +to ask you for the name of the root filesystem to use. +This happens because the distribution system is a \fIgeneric\fP +system, i.e., it can be bootstrapped on a cpu with its root device +and paging area on any available disk drive. +You will most likely respond to the root device question with ``sd0'' +if you are booting from a SCSI disk, +or with ``rd0'' if you are booting from an HP-IB disk. +This response shows that the disk it is running +on is drive 0 of type ``sd'' or ``rd'' respectively. +If you have other disks attached to the system, +it is possible that the drive you are using will not be configured +as logical drive 0. +Check the autoconfiguration messages printed out by the kernel to +make sure. +These messages will show the type of every logical drive +and their associated controller and slave addresses. +You will later build a system tailored to your configuration +that will not prompt you for a root device when it is bootstrapped. +.DS +\fBroot device?\fP \fI\*(Dk0\fP +\fBWARNING: preposterous time in filesystem \-\- CHECK AND RESET THE DATE!\fP +\fBerase ^?, kill ^U, intr ^C\fP +\fB#\fP +.DE +.PP +The \*(lqerase ...\*(rq message is part of the +.Pn /.profile +that was executed by the root shell when it started. This message +tells you about the settings of the character erase, +line erase, and interrupt characters. +.PP +UNIX is now running, +and the \fIUNIX Programmer's Manual\fP applies. The ``#'' is the prompt +from the Bourne shell, and lets you know that you are the super-user, +whose login name is \*(lqroot\*(rq. +.PP +At this point, the root filesystem is mounted read-only. +Before continuing the installation, the filesystem needs to be ``updated'' +to allow writing and device special files for the following steps need +to be created. +This is done as follows: +.DS +.TS +lw(2i) l. +\fB#\fP \fImount_mfs -s 1000 -T type /dev/null /tmp\fP (create a writable filesystem) +(\fItype\fP is the disk type as determined from /etc/disktab) +\fB#\fP \fIcd /tmp\fP (connect to that directory) +\fB#\fP \fI../dev/MAKEDEV \*(Dk#\fP (create special files for root disk) +(\fI\*(Dk\fP is the disk type, \fI#\fP is the unit number) +(ignore warning from ``sh'') +\fB#\fP \fImount \-uw /tmp/\*(Dk#a /\fP (read-write mount root filesystem) +\fB#\fP \fIcd /dev\fP (go to device directory) +\fB#\fP \fI./MAKEDEV \*(Dk#\fP (create permanent special files for root disk) +(again, ignore warning from ``sh'') +.TE +.DE +.Sh 4 "Step 4: (optional) restoring the root filesystem" +.PP +The root filesystem that you are currently running on is complete, +however it probably is not optimally laid out for the disk on +which you are running. +If you will be cloning copies of the system onto multiple disks for +other machines, you are advised to connect one of these disks to +this machine, and build and restore a properly laid out root filesystem +onto it. +If this is the only machine on which you will be running \*(4B +or peak performance is not an issue, you can skip this step and +proceed directly to step 5. +.PP +Connect a second disk to your machine. +If you bootstrapped using the two disk method, you can +overwrite your initial HP-UX disk, as it will no longer +be needed (assuming you have no plans to run HP-UX again). +.PP +To really create the root filesystem on drive 1 +you should first label the disk as described in step 5 below. +Then run the following commands: +.DS +\fB#\fP \fIcd /dev\fP +\fB#\fP \fI./MAKEDEV \*(Dk1a\fP +\fB#\fP\|\fInewfs /dev/r\*(Dk1a\fP +\fB#\fP\|\fImount /dev/\*(Dk1a /mnt\fP +\fB#\fP\|\fIcd /mnt\fP +\fB#\fP\|\fIdump 0f \- /dev/r\*(Dk0a | restore xf \-\fP +(Note: restore will ask if you want to ``set owner/mode for '.''' +to which you should reply ``yes''.) +.DE +.PP +When this completes, +you should then shut down the system, and boot on the disk that +you just created following the procedure in step (3) above. +.Sh 4 "Step 5: placing labels on the disks" +.PP +For each disk on the HP300, \*(4B places information about the geometry +of the drive and the partition layout at byte offset 1024. +This information is written with +.Xr disklabel (8). +.PP +The root image just loaded includes a ``generic'' label intended to allow +easy installation of the root and +.Pn /usr +and may not be suitable for the actual +disk on which it was installed. +In particular, +it may make your disk appear larger or smaller than its real size. +In the former case, you lose some capacity. +In the latter, some of the partitions may map non-existent sectors +leading to errors if those partitions are used. +It is also possible that the defined geometry will interact poorly with +the filesystem code resulting in reduced performance. +However, as long as you are willing to give up a little space, +not use certain partitions or suffer minor performance degradation, +you might want to avoid this step; +especially if you do not know how to use +.Xr ed (1). +.PP +If you choose to edit this label, +you can fill in correct geometry information from +.Pn /etc/disktab . +You may also want to rework the ``e'' and ``f'' partitions used for loading +.Pn /usr +and +.Pn /var . +You should not attempt to, and +.Xr disklabel +will not let you, modify the ``a'', ``b'' and ``d'' partitions. +To edit a label: +.DS +\fB#\fP \fIEDITOR=ed\fP +\fB#\fP \fIexport EDITOR\fP +\fB#\fP \fIdisklabel -r -e /dev/r\fBXX#\fPd +.DE +where \fBXX\fP is the type and \fB#\fP is the logical drive number; e.g. +.Pn /dev/rsd0d +or +.Pn /dev/rrd0d . +Note the explicit use of the ``d'' partition. +This partition includes the bootblock as does ``c'' +and using it allows you to change the size of ``c''. +.PP +If you wish to label any additional disks, run the following command for each: +.DS +\fB#\|\fP\fIdisklabel -rw \fBXX# type\fP \fI"optional_pack_name"\fP +.DE +where \fBXX#\fP is the same as in the previous command +and \fBtype\fP is the HP300 disk device name as listed in +.Pn /etc/disktab . +The optional information may contain any descriptive name for the +contents of a disk, and may be up to 16 characters long. This procedure +will place the label on the disk using the information found in +.Pn /etc/disktab +for the disk type named. +If you have changed the disk partition sizes, +you may wish to add entries for the modified configuration in +.Pn /etc/disktab +before labeling the affected disks. +.PP +You have now completed the HP300 specific part of the installation. +Now proceed to the generic part of the installation +described starting in section 2.5 below. +Note that where the disk name ``sd'' is used throughout section 2.5, +you should substitute the name ``rd'' if you are running on an HP-IB disk. +Also, if you are loading on a single disk with the default disklabel, +.Pn /var +should be restored to the ``f'' partition and +.Pn /usr +to the ``e'' partition. +.Sh 2 "Booting the SPARC" +.Sh 3 "Supported hardware" +.LP +The hardware supported by \*(4B for the SPARC is as follows: +.TS +center box; +lw(1i) lw(4i). +CPU's T{ +SPARCstation 1 series (1, 1+, SLC, IPC) and +SPARCstation 2 series (2, IPX). +T} +_ +DISK's T{ +SCSI. +T} +_ +TAPE's T{ +none. +T} +_ +NETWORK T{ +SPARCstation Lance (le). +T} +_ +GRAPHICS T{ +bwtwo and cgthree. +T} +_ +INPUT T{ +Keyboard and mouse. +T} +_ +MISC T{ +Battery-backed real time clock, +built-in serial devices, +Sbus SCSI controller, +and audio device. +T} +.TE +.LP +Major items that are not supported include +anything VME-based, +the GX (cgsix) display, +the floppy disk, and SCSI tapes. +.Sh 3 "Limitations" +.LP +There are several important limitations on the \*(4B distribution +for the SPARC: +.IP 1) +You +.B must +have SunOS 4.1.x or Solaris to bring up \*(4B. +There is no SPARCstation bootstrap code in this distribution. The +Sun-supplied boot loader will be used to boot \*(4B; you must copy +this from your SunOS distribution. This imposes several +restrictions on the system, as detailed below. +.IP 2) +The \*(4B SPARC kernel does not remap SCSI IDs. A SCSI disk at +target 0 will become ``sd0'', where in SunOS the same disk will +normally be called ``sd3''. If your existing SunOS system is +diskful, it will be least painful to have SunOS running on the disk +on target 0 lun 0 and put \*(4B on the disk on target 3 lun 0. Both +systems will then think they are running on ``sd0'', and you can +boot either system as needed simply by changing the EEPROM's boot +device. +.IP 3) +There is no SCSI tape driver. +You must have another system for tape reading and backups. +.IP 4) +Although the \*(4B SPARC kernel will handle existing SunOS shared +libraries, it does not use or create them itself, and therefore +requires much more disk space than SunOS does. +.IP 5) +It is currently difficult (though not completely impossible) to +run \*(4B diskless. These instructions assume you will have a local +boot, swap, and root filesystem. +.IP 6) +When using a serial port rather than a graphics display as the console, +only port +.Pn ttya +can be used. +Attempts to use port +.Pn ttyb +will fail when the kernel tries +to print the boot up messages to the console. +.Sh 3 "The procedure" +.PP +You must have a spare disk on which to place \*(4B. +The steps involved in bootstrapping this tape are as follows: +.IP 1) +Bring up SunOS (preferably SunOS 4.1.x or Solaris 1.x, although +Solaris 2 may work \(em this is untested). +.IP 2) +Attach auxiliary SCSI disk(s). Format and label using the +SunOS formatting and labeling programs as needed. +Note that the root filesystem currently requires at least 10 MB; 16 MB +or more is recommended. The b partition will be used for swap; +this should be at least 32 MB. +.IP 3) +Use the SunOS +.Xr newfs +to build the root filesystem. You may also +want to build other filesystems at the same time. (By default, the +\*(4B +.Xr newfs +builds a filesystem that SunOS will not handle; if you +plan to switch OSes back and forth you may want to sacrifice the +performance gain from the new filesystem format for compatibility.) +You can build an old-format filesystem on \*(4B by giving the \-O +option to +.Xr newfs (8). +.Xr Fsck (8) +can convert old format filesystems to new format +filesystems, but not vice versa, +so you may want to initially build old format filesystems so that they +can be mounted under SunOS, +and then later convert them to new format filesystems when you are +satisfied that \*(4B is running properly. +In any case, +.B +you must build an old-style root filesystem +.R +so that the SunOS boot program will work. +.IP 4) +Mount the new root, then copy the SunOS +.Pn /boot +into place and use the SunOS ``installboot'' program +to enable disk-based booting. +Note that the filesystem must be mounted when you do the ``installboot'': +.DS +.ft CW +# mount /dev/sd3a /mnt +# cp /boot /mnt/boot +# cd /usr/kvm/mdec +# installboot /mnt/boot bootsd /dev/rsd3a +.DE +The SunOS +.Pn /boot +will load \*(4B kernels; there is no SPARCstation +bootstrap code on the distribution. Note that the SunOS +.Pn /boot +does not handle the new \*(4B filesystem format. +.IP 5) +Restore the contents of the \*(4B root filesystem. +.DS +.ft CW +# cd /mnt +# rrestore xf tapehost:/dev/nrst0 +.DE +.IP 6) +Boot the supplied kernel: +.DS +.ft CW +# halt +ok boot sd(0,3)vmunix -s [for old proms] OR +ok boot disk3 -s [for new proms] +\&... [\*(4B boot messages] +.DE +.LP +To install the remaining filesystems, use the procedure described +starting in section 2.5. +In these instructions, +.Pn /usr +should be loaded into the ``e'' partition and +.Pn /var +in the ``f'' partition. +.LP +After completing the filesystem installation you may want +to set up \*(4B to reboot automatically: +.DS +.ft CW +# halt +ok setenv boot-from sd(0,3)vmunix [for old proms] OR +ok setenv boot-device disk3 [for new proms] +.DE +If you build backwards-compatible filesystems, either with the SunOS +newfs or with the \*(4B ``\-O'' option, you can mount these under +SunOS. The SunOS fsck will, however, always think that these filesystems +are corrupted, as there are several new (previously unused) +superblock fields that are updated in \*(4B. Running ``fsck \-b32'' +and letting it ``fix'' the superblock will take care of this. +.sp 0.5 +If you wish to run SunOS binaries that use SunOS shared libraries, you +simply need to copy all the dynamic linker files from an existing +SunOS system: +.DS +.ft CW +# rcp sunos-host:/etc/ld.so.cache /etc/ +# rcp sunos-host:'/usr/lib/*.so*' /usr/lib/ +.DE +The SunOS compiler and linker should be able to produce SunOS binaries +under \*(4B, but this has not been tested. If you plan to try it you +will need the appropriate .sa files as well. +.Sh 2 "Booting the DECstation" +.Sh 3 "Supported hardware" +.LP +The hardware supported by \*(4B for the DECstation is as follows: +.TS +center box; +lw(1i) lw(4i). +CPU's T{ +R2000 based (3100) and +R3000 based (5000/200, 5000/20, 5000/25, 5000/1xx). +T} +_ +DISK's T{ +SCSI-I (tested RZ23, RZ55, RZ57, Maxtor 8760S). +T} +_ +TAPE's T{ +SCSI-I (tested DEC TK50, Archive DAT, Emulex MT02). +T} +_ +RS232 T{ +Internal DEC dc7085 and AMD 8530 based interfaces. +T} +_ +NETWORK T{ +TURBOchannel PMAD-AA and internal LANCE based interfaces. +T} +_ +GRAPHICS T{ +Terminal emulation and raw frame buffer support for +3100 (color & monochrome), +TURBOchannel PMAG-AA, PMAG-BA, PMAG-DV. +T} +_ +INPUT T{ +Standard DEC keyboard (LK201) and mouse. +T} +_ +MISC T{ +Battery-backed real time clock, +internal and TURBOchannel PMAZ-AA SCSI interfaces. +T} +.TE +.LP +Major items that are not supported include the 5000/240 +(there is code but not compiled in or tested), +R4000 based machines, FDDI and audio interfaces. +Diskless machines are not supported but booting kernels and bootstrapping +over the network is supported on the 5000 series. +.Sh 3 "The procedure" +.PP +The first file on the distribution tape is a tar file that contains +four files. +The first step requires a running UNIX (or ULTRIX) system that can +be used to extract the tar archive from the first file on the tape. +The command: +.DS +.ft CW +tar xf /dev/rmt0 +.DE +will extract the following four files: +.DS +A) root.image: \fIdd\fP image of the root filesystem +B) vmunix.tape: \fIdd\fP image for creating boot tapes +C) vmunix.net: file for booting over the network +D) root.dump: \fIdump\fP image of the root filesystem +.DE +There are three basic ways a system can be bootstrapped corresponding to the +first three files. +You may want to read the section on bootstrapping the HP300 +since many of the steps are similar. +A spare, formatted SCSI disk is also useful. +.Sh 4 "Procedure A: copy root filesystem to disk" +.PP +This procedure is similar to the HP300. +If you have an extra disk, the easiest approach is to use \fIdd\fP\|(1) +under ULTRIX to copy the root filesystem image to the beginning +of the spare disk. +The root filesystem image includes a disklabel and bootblock along with the +root filesystem. +An example command to copy the image to the beginning of a disk is: +.DS +.ft CW +dd if=root.image of=/dev/rz1c bs=\*(Bzb +.DE +The actual special file syntax will vary depending on unit numbers and +the version of ULTRIX that is running. +This system is now ready to boot. You can boot the kernel with one of the +following PROM commands. If you are booting on a 3100, the disk must be SCSI +id zero because of a bug. +.DS +.ft CW +DEC 3100: boot \-f rz(0,0,0)vmunix +DEC 5000: boot 5/rz0/vmunix +.DE +You can then proceed to section 2.5 +to create reasonable disk partitions for your machine +and then install the rest of the system. +.Sh 4 "Procedure B: bootstrap from tape" +.PP +If you have only a single machine with a single disk, +you need to use the more difficult approach of booting a +kernel and mini-root from tape or the network, and using it to restore +the root filesystem. +.PP +First, you will need to create a boot tape. This can be done using +\fIdd\fP as in the following example. +.DS +.ft CW +dd if=vmunix.tape of=/dev/nrmt0 bs=1b +dd if=root.dump of=/dev/nrmt0 bs=\*(Bzb +.DE +The actual special file syntax for the tape drive will vary depending on +unit numbers, tape device and the version of ULTRIX that is running. +.PP +The first file on the boot tape contains a boot header, kernel, and +mini-root filesystem that the PROM can copy into memory. +Installing from tape has only been tested +on a 3100 and a 5000/200 using a TK50 tape drive. Here are two example +PROM commands to boot from tape. +.DS +.ft CW +DEC 3100: boot \-f tz(0,5,0) m # 5 is the SCSI id of the TK50 +DEC 5000: boot 5/tz6 m # 6 is the SCSI id of the TK50 +.DE +The `m' argument tells the kernel to look for a root filesystem in memory. +Next you should proceed to section 2.4.3 to build a disk-based root filesystem. +.Sh 4 "Procedure C: bootstrap over the network" +.PP +You will need a host machine that is running the \fIbootp\fP server +with the +.Pn vmunix.net +file installed in the default directory defined by the +configuration file for +.Xr bootp . +Here are two example PROM commands to boot across the net: +.DS +.ft CW +DEC 3100: boot \-f tftp()vmunix.net m +DEC 5000: boot 6/tftp/vmunix.net m +.DE +This command should load the kernel and mini-root into memory and +run the same as the tape install (procedure B). +The rest of the steps are the same except +you will need to start the network +(if you are unsure how to fill in the <name> fields below, +see sections 4.4 and 5). +Execute the following to start the networking: +.DS +.ft CW +# mount \-uw / +# echo 127.0.0.1 localhost >> /etc/hosts +# echo <your.host.inet.number> myname.my.domain myname >> /etc/hosts +# echo <friend.host.inet.number> myfriend.my.domain myfriend >> /etc/hosts +# ifconfig le0 inet myname +.DE +Next you should proceed to section 2.4.3 to build a disk-based root filesystem. +.Sh 3 "Label disk and create the root filesystem" +.LP +There are five steps to create a disk-based root filesystem. +.IP 1) +Label the disk. +.DS +.ft CW +# disklabel -W /dev/rrz?c # This enables writing the label +# disklabel -w -r -B /dev/rrz?c $DISKTYPE +# newfs /dev/rrz?a +\&... +# fsck /dev/rrz?a +\&... +.DE +Supported disk types are listed in +.Pn /etc/disktab . +.IP 2) +Restore the root filesystem. +.DS +.ft CW +# mount \-uw / +# mount /dev/rz?a /a +# cd /a +.DE +.ti +0.4i +If you are restoring locally (procedure B), run: +.DS +.ft CW +# mt \-f /dev/nrmt0 rew +# restore \-xsf 2 /dev/rmt0 +.DE +.ti +0.4i +If you are restoring across the net (procedure c), run: +.DS +.ft CW +# rrestore xf myfriend:/path/to/root.dump +.DE +.ti +0.4i +When the restore finishes, clean up with: +.DS +.ft CW +# cd / +# sync +# umount /a +# fsck /dev/rz?a +.DE +.IP 3) +Reset the system and initialize the PROM monitor to boot automatically. +.DS +.ft CW +DEC 3100: setenv bootpath boot \-f rz(0,?,0)vmunix +DEC 5000: setenv bootpath 5/rz?/vmunix -a +.DE +.IP 4) +After booting UNIX, you will need to create +.Pn /dev/mouse +to run X windows as in the following example. +.DS +.ft CW +rm /dev/mouse +ln /dev/xx /dev/mouse +.DE +The 'xx' should be one of the following: +.DS +pm0 raw interface to PMAX graphics devices +cfb0 raw interface to TURBOchannel PMAG-BA color frame buffer +xcfb0 raw interface to maxine graphics devices +mfb0 raw interface to mono graphics devices +.DE +You can then proceed to section 2.5 to install the rest of the system. +Note that where the disk name ``sd'' is used throughout section 2.5, +you should substitute the name ``rz''. +.Sh 2 "Disk configuration" +.PP +All architectures now have a root filesystem up and running and +proceed from this point to layout filesystems to make use +of the available space and to balance disk load for better system +performance. +.Sh 3 "Disk naming and divisions" +.PP +Each physical disk drive can be divided into up to 8 partitions; +UNIX typically uses only 3 or 4 partitions. +For instance, the first partition, \*(Dk0a, +is used for a root filesystem, a backup thereof, +or a small filesystem like, +.Pn /var/tmp ; +the second partition, \*(Dk0b, +is used for paging and swapping; and +a third partition, typically \*(Dk0e, +holds a user filesystem. +.PP +The space available on a disk varies per device. +Each disk typically has a paging area of 30 to 100 megabytes +and a root filesystem of about 17 megabytes. +.\" XXX check +The distributed system binaries occupy about 150 (180 with X11R5) megabytes +.\" XXX check +while the major sources occupy another 250 (340 with X11R5) megabytes. +The +.Pn /var +filesystem as delivered on the tape is only 2Mb, +however it should have at least 50Mb allocated to it just for +normal system activity. +Usually it is allocated the last partition on the disk +so that it can provide as much space as possible to the +.Pn /var/users +filesystem. +See section 2.5.4 for further details on disk layouts. +.PP +Be aware that the disks have their sizes +measured in disk sectors (usually 512 bytes), while the UNIX filesystem +blocks are variable sized. +If +.Sm BLOCKSIZE=1k +is set in the user's environment, all user programs report +disk space in kilobytes, otherwise, +disk sizes are always reported in units of 512-byte sectors\**. +.FS +You can thank System V intransigence and POSIX duplicity for +requiring that 512-byte blocks be the units that programs report. +.FE +The +.Pn /etc/disktab +file used in labelling disks and making filesystems +specifies disk partition sizes in sectors. +.Sh 3 "Layout considerations" +.PP +There are several considerations in deciding how +to adjust the arrangement of things on your disks. +The most important is making sure that there is adequate space +for what is required; secondarily, throughput should be maximized. +Paging space is an important parameter. +The system, as distributed, sizes the configured +paging areas each time the system is booted. Further, +multiple paging areas of different sizes may be interleaved. +.PP +Many common system programs (C, the editor, the assembler etc.) +create intermediate files in the +.Pn /tmp +directory, so the filesystem where this is stored also should be made +large enough to accommodate most high-water marks. +Typically, +.Pn /tmp +is constructed from a memory-based filesystem (see +.Xr mount_mfs (8)). +Programs that want their temporary files to persist +across system reboots (such as editors) should use +.Pn /var/tmp . +If you plan to use a disk-based +.Pn /tmp +filesystem to avoid loss across system reboots, it makes +sense to mount this in a ``root'' (i.e. first partition) +filesystem on another disk. +All the programs that create files in +.Pn /tmp +take care to delete them, but are not immune to rare events +and can leave dregs. +The directory should be examined every so often and the old +files deleted. +.PP +The efficiency with which UNIX is able to use the CPU +is often strongly affected by the configuration of disk controllers; +it is critical for good performance to balance disk load. +There are at least five components of the disk load that you can +divide between the available disks: +.IP 1) +The root filesystem. +.IP 2) +The +.Pn /var +and +.Pn /var/tmp +filesystems. +.IP 3) +The +.Pn /usr +filesystem. +.IP 4) +The user filesystems. +.IP 5) +The paging activity. +.LP +The following possibilities are ones we have used at times +when we had 2, 3 and 4 disks: +.TS +center doublebox; +l | c s s +l | lw(5) | lw(5) | lw(5). + disks +what 2 3 4 +_ +root 0 0 0 +var 1 2 3 +usr 1 1 1 +paging 0+1 0+2 0+2+3 +users 0 0+2 0+2 +archive x x 3 +.TE +.PP +The most important things to consider are to +even out the disk load as much as possible, and to do this by +decoupling filesystems (on separate arms) between which heavy copying occurs. +Note that a long term average balanced load is not important; it is +much more important to have an instantaneously balanced +load when the system is busy. +.PP +Intelligent experimentation with a few filesystem arrangements can +pay off in much improved performance. It is particularly easy to +move the root, the +.Pn /var +and +.Pn /var/tmp +filesystems and the paging areas. Place the +user files and the +.Pn /usr +directory as space needs dictate and experiment +with the other, more easily moved filesystems. +.Sh 3 "Filesystem parameters" +.PP +Each filesystem is parameterized according to its block size, +fragment size, and the disk geometry characteristics of the +medium on which it resides. Inaccurate specification of the disk +characteristics or haphazard choice of the filesystem parameters +can result in substantial throughput degradation or significant +waste of disk space. As distributed, +filesystems are configured according to the following table. +.DS +.TS +center; +l l l. +Filesystem Block size Fragment size +_ +root 8 kbytes 1 kbytes +usr 8 kbytes 1 kbytes +users 4 kbytes 512 bytes +.TE +.DE +.PP +The root filesystem block size is +made large to optimize bandwidth to the associated disk. +The large block size is important as many of the most +heavily used programs are demand paged out of the +.Pn /bin +directory. +The fragment size of 1 kbyte is a ``nominal'' value to use +with a filesystem. With a 1 kbyte fragment size +disk space utilization is about the same +as with the earlier versions of the filesystem. +.PP +The filesystems for users have a 4 kbyte block +size with 512 byte fragment size. These parameters +have been selected based on observations of the +performance of our user filesystems. The 4 kbyte +block size provides adequate bandwidth while the +512 byte fragment size provides acceptable space compaction +and disk fragmentation. +.PP +Other parameters may be chosen in constructing filesystems, +but the factors involved in choosing a block +size and fragment size are many and interact in complex +ways. Larger block sizes result in better +throughput to large files in the filesystem as +larger I/O requests will then be done by the +system. However, +consideration must be given to the average file sizes +found in the filesystem and the performance of the +internal system buffer cache. The system +currently provides space in the inode for +12 direct block pointers, 1 single indirect block +pointer, 1 double indirect block pointer, +and 1 triple indirect block pointer. +If a file uses only direct blocks, access time to +it will be optimized by maximizing the block size. +If a file spills over into an indirect block, +increasing the block size of the filesystem may +decrease the amount of space used +by eliminating the need to allocate an indirect block. +However, if the block size is increased and an indirect +block is still required, then more disk space will be +used by the file because indirect blocks are allocated +according to the block size of the filesystem. +.PP +In selecting a fragment size for a filesystem, at least +two considerations should be given. The major performance +tradeoffs observed are between an 8 kbyte block filesystem +and a 4 kbyte block filesystem. Because of implementation +constraints, the block size versus fragment size ratio can not +be greater than 8. This means that an 8 kbyte filesystem +will always have a fragment size of at least 1 kbytes. If +a filesystem is created with a 4 kbyte block size and a +1 kbyte fragment size, then upgraded to an 8 kbyte block size +and 1 kbyte fragment size, identical space compaction will be +observed. However, if a filesystem has a 4 kbyte block size +and 512 byte fragment size, converting it to an 8K/1K +filesystem will result in 4-8% more space being +used. This implies that 4 kbyte block filesystems that +might be upgraded to 8 kbyte blocks for higher performance should +use fragment sizes of at least 1 kbytes to minimize the amount +of work required in conversion. +.PP +A second, more important, consideration when selecting the +fragment size for a filesystem is the level of fragmentation +on the disk. With an 8:1 fragment to block ratio, storage fragmentation +occurs much sooner, particularly with a busy filesystem running +near full capacity. By comparison, the level of fragmentation in a +4:1 fragment to block ratio filesystem is one tenth as severe. This +means that on filesystems where many files are created and +deleted, the 512 byte fragment size is more likely to result in apparent +space exhaustion because of fragmentation. That is, when the filesystem +is nearly full, file expansion that requires locating a +contiguous area of disk space is more likely to fail on a 512 +byte filesystem than on a 1 kbyte filesystem. To minimize +fragmentation problems of this sort, a parameter in the super +block specifies a minimum acceptable free space threshold. When +normal users (i.e. anyone but the super-user) attempt to allocate +disk space and the free space threshold is exceeded, the user is +returned an error as if the filesystem were really full. This +parameter is nominally set to 5%; it may be changed by supplying +a parameter to +.Xr newfs (8), +or by updating the super block of an existing filesystem using +.Xr tunefs (8). +.PP +Finally, a third, less common consideration is the attributes of +the disk itself. The fragment size should not be smaller than the +physical sector size of the disk. As an example, the HP magneto-optical +disks have 1024 byte physical sectors. Using a 512 byte fragment size +on such disks will work but is extremely inefficient. +.PP +Note that the above discussion considers block sizes of up to only 8k. +As of the 4.4 release, the maximum block size has been increased to 64k. +This allows an entirely new set of block/fragment combinations for which +there is little experience to date. +In general though, unless a filesystem is to be used +for a special purpose application (for example, storing +image processing data), we recommend using the +values supplied above. +Remember that the current +implementation limits the block size to at most 64 kbytes +and the ratio of block size versus fragment size must be 1, 2, 4, or 8. +.PP +The disk geometry information used by the filesystem +affects the block layout policies employed. The file +.Pn /etc/disktab , +as supplied, contains the data for most +all drives supported by the system. Before constructing +a filesystem with +.Xr newfs (8) +you should label the disk (if it has not yet been labeled, +and the driver supports labels). +If labels cannot be used, you must instead +specify the type of disk on which the filesystem resides; +.Xr newfs +then reads +.Pn /etc/disktab +instead of the pack label. +This file also contains the default +filesystem partition +sizes, and default block and fragment sizes. To +override any of the default values you can modify the file, +edit the disk label, +or use an option to +.Xr newfs . +.Sh 3 "Implementing a layout" +.PP +To put a chosen disk layout into effect, you should use the +.Xr newfs (8) +command to create each new filesystem. +Each filesystem must also be added to the file +.Pn /etc/fstab +so that it will be checked and mounted when the system is bootstrapped. +.PP +First we will consider a system with a single disk. +There is little real choice on how to do the layout; +the root filesystem goes in the ``a'' partition, +.Pn /usr +goes in the ``e'' partition, and +.Pn /var +fills out the remainder of the disk in the ``f'' partition. +This is the organization used if you loaded the disk-image root filesystem. +With the addition of a memory-based +.Pn /tmp +filesystem, its fstab entry would be as follows: +.TS +center; +lfC lfC l l n n. +/dev/\*(Dk0a / ufs rw 1 1 +/dev/\*(Dk0b none swap sw 0 0 +/dev/\*(Dk0b /tmp mfs rw,-s=14000,-b=8192,-f=1024,-T=sd660 0 0 +/dev/\*(Dk0e /usr ufs ro 1 2 +/dev/\*(Dk0f /var ufs rw 1 2 +.TE +.PP +If we had a second disk, we would split the load between the drives. +On the second disk, we place the +.Pn /usr +and +.Pn /var +filesystems in their usual \*(Dk1e and \*(Dk1f +partitions respectively. +The \*(Dk1b partition would be used as a second paging area, +and the \*(Dk1a partition left as a spare root filesystem +(alternatively \*(Dk1a could be used for +.Pn /var/tmp ). +The first disk still holds the +the root filesystem in \*(Dk0a, and the primary swap area in \*(Dk0b. +The \*(Dk0e partition is used to hold home directories in +.Pn /var/users . +The \*(Dk0f partition can be used for +.Pn /usr/src +or alternately the \*(Dk0e partition can be extended to cover +the rest of the disk with +.Xr disklabel (8). +As before, the +.Pn /tmp +directory is a memory-based filesystem. +Note that to interleave the paging between the two disks +you must build a system configuration that specifies: +.DS +config vmunix root on \*(Dk0 swap on \*(Dk0 and \*(Dk1 +.DE +The +.Pn /etc/fstab +file would then contain +.TS +center; +lfC lfC l l n n. +/dev/\*(Dk0a / ufs rw 1 1 +/dev/\*(Dk0b none swap sw 0 0 +/dev/\*(Dk1b none swap sw 0 0 +/dev/\*(Dk0b /tmp mfs rw,-s=14000,-b=8192,-f=1024,-T=sd660 0 0 +/dev/\*(Dk1e /usr ufs ro 1 2 +/dev/\*(Dk0f /usr/src ufs rw 1 2 +/dev/\*(Dk1f /var ufs rw 1 2 +/dev/\*(Dk0e /var/users ufs rw 1 2 +.TE +.PP +To make the +.Pn /var +filesystem we would do: +.DS +\fB#\fP \fIcd /dev\fP +\fB#\fP \fIMAKEDEV \*(Dk1\fP +\fB#\fP \fIdisklabel -wr \*(Dk1 "disk type" "disk name"\fP +\fB#\fP \fInewfs \*(Dk1f\fP +(information about filesystem prints out) +\fB#\fP \fImkdir /var\fP +\fB#\fP \fImount /dev/\*(Dk1f /var\fP +.DE +.Sh 2 "Installing the rest of the system" +.PP +At this point you should have your disks partitioned. +The next step is to extract the rest of the data from the tape. +At a minimum you need to set up the +.Pn /var +and +.Pn /usr +filesystems. +You may also want to extract some or all the program sources. +Since not all architectures support tape drives or don't support the +correct ones, you may need to extract the files indirectly using +.Xr rsh (1). +For example, for a directly connected tape drive you might do: +.DS +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf\fP +\fB#\fP \fItar xbpf \*(Bz /dev/nr\*(Mt0\fP +.DE +The equivalent indirect procedure (where the tape drive is on machine ``foo'') +is: +.DS +\fB#\fP \fIrsh foo mt -f /dev/nr\*(Mt0 fsf\fP +\fB#\fP \fIrsh foo dd if=/dev/nr\*(Mt0 bs=\*(Bzb | tar xbpf \*(Bz -\fP +.DE +Obviously, the target machine must be connected to the local network +for this to work. +To do this: +.DS +\fB#\fP \fIecho 127.0.0.1 localhost >> /etc/hosts\fP +\fB#\fP \fIecho \fPyour.host.inet.number myname.my.domain myname\fI >> /etc/hosts\fP +\fB#\fP \fIecho \fPfriend.host.inet.number myfriend.my.domain myfriend\fI >> /etc/hosts\fP +\fB#\fP \fIifconfig le0 inet \fPmyname +.DE +where the ``host.inet.number'' fields are the IP addresses for your host and +the host with the tape drive +and the ``my.domain'' fields are the names of your machine and the tape-hosting +machine. +See sections 4.4 and 5 for more information on setting up the network. +.PP +Assuming a directly connected tape drive, here is how to extract and +install +.Pn /var +and +.Pn /usr : +.br +.ne 5 +.TS +lw(2i) l. +\fB#\fP \fImount \-uw /dev/\*(Dk#a /\fP (read-write mount root filesystem) +\fB#\fP \fIdate yymmddhhmm\fP (set date, see \fIdate\fP\|(1)) +\&.... +\fB#\fP \fIpasswd -l root\fP (set password for super-user) +\fBNew password:\fP (password will not echo) +\fBRetype new password:\fP +\fB#\fP \fIpasswd -l toor\fP (set password for super-user) +\fBNew password:\fP (password will not echo) +\fBRetype new password:\fP +\fB#\fP \fIhostname mysitename\fP (set your hostname) +\fB#\fP \fInewfs r\*(Dk#p\fP (create empty user filesystem) +(\fI\*(Dk\fP is the disk type, \fI#\fP is the unit number, +\fIp\fP is the partition; this takes a few minutes) +\fB#\fP \fImount /dev/\*(Dk#p /var\fP (mount the var filesystem) +\fB#\fP \fIcd /var\fP (make /var the current directory) +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf\fP (space to end of previous tape file) +\fB#\fP \fItar xbpf \*(Bz /dev/nr\*(Mt0\fP (extract all of var) +(this takes a few minutes) +\fB#\fP \fInewfs r\*(Dk#p\fP (create empty user filesystem) +(as before \fI\*(Dk\fP is the disk type, \fI#\fP is the unit number, +\fIp\fP is the partition) +\fB#\fP \fImount /dev/\*(Dk#p /mnt\fP (mount the new /usr in temporary location) +\fB#\fP \fIcd /mnt\fP (make /mnt the current directory) +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf\fP (space to end of previous tape file) +\fB#\fP \fItar xbpf \*(Bz /dev/nr\*(Mt0\fP (extract all of usr except usr/src) +(this takes about 15-20 minutes) +\fB#\fP \fIcd /\fP (make / the current directory) +\fB#\fP \fIumount /mnt\fP (unmount from temporary mount point) +\fB#\fP \fIrm -r /usr/*\fP (remove excess bootstrap binaries) +\fB#\fP \fImount /dev/\*(Dk#p /usr\fP (remount /usr) +.TE +If no disk label has been installed on the disk, the +.Xr newfs +command will require a third argument to specify the disk type, +using one of the names in +.Pn /etc/disktab . +If the tape had been rewound or positioned incorrectly before the +.Xr tar , +to extract +.Pn /var +it may be repositioned by the following commands. +.DS +\fB#\fP \fImt -f /dev/nr\*(Mt0 rew\fP +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf 1\fP +.DE +The data on the second and third tape files has now been extracted. +If you are using 6250bpi tapes, the first reel of the +distribution is no longer needed; you should now mount the second +reel instead. The installation procedure continues from this +point on the 8mm tape. +The next step is to extract the sources. +As previously noted, +.Pn /usr/src +.\" XXX Check +requires about 250-340Mb of space. +Ideally sources should be in a separate filesystem; +if you plan to put them into your +.Pn /usr +filesystem, it will need at least 500Mb of space. +Assuming that you will be using a separate filesystem on \*(Dk0f for +.Pn /usr/src , +you will start by creating and mounting it: +.DS +\fB#\fP \fInewfs \*(Dk0f\fP +(information about filesystem prints out) +\fB#\fP \fImkdir /usr/src\fP +\fB#\fP \fImount /dev/\*(Dk0f /usr/src\fP +.DE +.LP +First you will extract the kernel source: +.DS +.TS +lw(2i) l. +\fB#\fP \fIcd /usr/src\fP +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf\fP (space to end of previous tape file) +(this should only be done on Exabyte distributions) +\fB#\fP \fItar xpbf \*(Bz /dev/nr\*(Mt0\fP (extract the kernel sources) +(this takes about 15-30 minutes) +.TE +.DE +.LP +The next tar file contains the sources for the utilities. +It is extracted as follows: +.DS +.TS +lw(2i) l. +\fB#\fP \fIcd /usr/src\fP +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf\fP (space to end of previous tape file) +\fB#\fP \fItar xpbf \*(Bz /dev/rmt12\fP (extract the utility source) +(this takes about 30-60 minutes) +.TE +.DE +.PP +If you are using 6250bpi tapes, the second reel of the +distribution is no longer needed; you should now mount the third +reel instead. The installation procedure continues from this +point on the 8mm tape. +.PP +The next tar file contains the sources for the contributed software. +It is extracted as follows: +.DS +.TS +lw(2i) l. +\fB#\fP \fIcd /usr/src\fP +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf\fP (space to end of previous tape file) +(this should only be done on Exabyte distributions) +\fB#\fP \fItar xpbf \*(Bz /dev/rmt12\fP (extract the contributed software source) +(this takes about 30-60 minutes) +.TE +.DE +.PP +If you received a distribution on 8mm Exabyte tape, +there is one additional tape file on the distribution tape +that has not been installed to this point; it contains the +sources for X11R5 in +.Xr tar (1) +format. As distributed, X11R5 should be placed in +.Pn /usr/src/X11R5 . +.DS +.TS +lw(2i) l. +\fB#\fP \fIcd /usr/src\fP +\fB#\fP \fImt -f /dev/nr\*(Mt0 fsf\fP (space to end of previous tape file) +\fB#\fP \fItar xpbf \*(Bz /dev/nr\*(Mt0\fP (extract the X11R5 source) +(this takes about 30-60 minutes) +.TE +.DE +Many of the X11 utilities search using the path +.Pn /usr/X11 , +so be sure that you have a symbolic link that points at +the location of your X11 binaries (here, X11R5). +.PP +Having now completed the extraction of the sources, +you may want to verify that your +.Pn /usr/src +filesystem is consistent. +To do so, you must unmount it, and run +.Xr fsck (8); +assuming that you used \*(Dk0f you would proceed as follows: +.DS +.TS +lw(2i) l. +\fB#\fP \fIcd /\fP (change directory, back to the root) +\fB#\fP \fIumount /usr/src\fP (unmount /usr/src) +\fB#\fP \fIfsck /dev/r\*(Dk0f\fP +.TE +.DE +The output from +.Xr fsck +should look something like: +.DS +.B +** /dev/r\*(Dk0f +** Last Mounted on /usr/src +** Phase 1 - Check Blocks and Sizes +** Phase 2 - Check Pathnames +** Phase 3 - Check Connectivity +** Phase 4 - Check Reference Counts +** Phase 5 - Check Cyl groups +23000 files, 261000 used, 39000 free (2200 frags, 4600 blocks) +.R +.DE +.PP +If there are inconsistencies in the filesystem, you may be prompted +to apply corrective action; see the +.Xr fsck (8) +or \fIFsck \(en The UNIX File System Check Program\fP (SMM:3) for more details. +.PP +To use the +.Pn /usr/src +filesystem, you should now remount it with: +.DS +\fB#\fP \fImount /dev/\*(Dk0f /usr/src\fP +.DE +or if you have made an entry for it in +.Pn /etc/fstab +you can remount it with: +.DS +\fB#\fP \fImount /usr/src\fP +.DE +.Sh 2 "Additional conversion information" +.PP +After setting up the new \*(4B filesystems, you may restore the user +files that were saved on tape before beginning the conversion. +Note that the \*(4B +.Xr restore +program does its work on a mounted filesystem using normal system operations. +This means that filesystem dumps may be restored even +if the characteristics of the filesystem changed. +To restore a dump tape for, say, the +.Pn /a +filesystem something like the following would be used: +.DS +\fB#\fP \fImkdir /a\fP +\fB#\fP \fInewfs \*(Dk#p\fI +\fB#\fP \fImount /dev/\*(Dk#p /a\fP +\fB#\fP \fIcd /a\fP +\fB#\fP \fIrestore x\fP +.DE +.PP +If +.Xr tar +images were written instead of doing a dump, you should +be sure to use its `\-p' option when reading the files back. No matter +how you restore a filesystem, be sure to unmount it and and check its +integrity with +.Xr fsck (8) +when the job is complete. diff --git a/share/doc/smm/01.setup/3.t b/share/doc/smm/01.setup/3.t new file mode 100644 index 00000000000..b0f0c2b9e7e --- /dev/null +++ b/share/doc/smm/01.setup/3.t @@ -0,0 +1,2001 @@ +.\" Copyright (c) 1980, 1986, 1988, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)3.t 8.1 (Berkeley) 7/27/93 +.\" +.ds lq `` +.ds rq '' +.ds RH "Upgrading a \*(Ps System +.ds CF \*(Dy +.Sh 1 "Upgrading a \*(Ps system" +.PP +This section describes the procedure for upgrading a \*(Ps +system to \*(4B. This procedure may vary according to the version of +the system running before conversion. +If you are converting from a +System V system, some of this section will still apply (in particular, +the filesystem conversion). However, many of the system configuration +files are different, and the executable file formats are completely +incompatible. +.PP +In particular be wary when using this information to upgrade +a \*(Ps HP300 system. +There are at least four different versions of ``\*(Ps'' out there: +.IP 1) +HPBSD 1.x from Utah. +.br +This was the original version of \*(Ps for HP300s from which the +other variants (and \*(4B) are derived. +It is largely a \*(Ps system with Sun's NFS 3.0 filesystem code and +some \*(Ps-Tahoe features (e.g. networking code). +Since the filesystem code is 4.2/4.3 vintage and the filesystem +hierarchy is largely \*(Ps, most of this section should apply. +.IP 2) +MORE/bsd from Mt. Xinu. +.br +This is a \*(Ps-Tahoe vintage system with Sun's NFS 4.0 filesystem code +upgraded with Tahoe UFS features. +The instructions for \*(Ps-Tahoe should largely apply. +.IP 3) +\*(Ps-Reno from CSRG. +.br +At least one site bootstrapped HP300 support from the Reno distribution. +The Reno filesystem code was somewhere between \*(Ps and \*(4B: the VFS switch +had been added but many of the UFS features (e.g. ``inline'' symlinks) +were missing. +The filesystem hierarchy reorganization first appeared in this release. +Be extremely careful following these instructions if you are +upgrading from the Reno distribution. +.IP 4) +HPBSD 2.0 from Utah. +.br +As if things were not bad enough already, +this release has the \*(4B filesystem and networking code +as well as some utilities, but still has a \*(Ps hierarchy. +No filesystem conversions are necessary for this upgrade, +but files will still need to be moved around. +.Sh 2 "Installation overview" +.PP +If you are running \*(Ps, upgrading your system +involves replacing your kernel and system utilities. +In general, there are three possible ways to install a new \*(Bs distribution: +(1) boot directly from the distribution tape, use it to load new binaries +onto empty disks, and then merge or restore any existing configuration files +and filesystems; +(2) use an existing \*(Ps or later system to extract the root and +.Pn /usr +filesystems from the distribution tape, +boot from the new system, then merge or restore existing +configuration files and filesystems; or +(3) extract the sources from the distribution tape onto an existing system, +and use that system to cross-compile and install \*(4B. +For this release, the second alternative is strongly advised, +with the third alternative reserved as a last resort. +In general, older binaries will continue to run under \*(4B, +but there are many exceptions that are on the critical path +for getting the system running. +Ideally, the new system binaries (root and +.Pn /usr +filesystems) should be installed on spare disk partitions, +then site-specific files should be merged into them. +Once the new system is up and fully merged, the previous root and +.Pn /usr +filesystems can be reused. +Other existing filesystems can be retained and used, +except that (as usual) the new +.Xr fsck +should be run before they are mounted. +.PP +It is \fBSTRONGLY\fP advised that you make full dumps of each filesystem +before beginning, especially any that you intend to modify in place +during the merge. +It is also desirable to run filesystem checks +of all filesystems to be converted to \*(4B before shutting down. +This is an excellent time to review your disk configuration +for possible tuning of the layout. +Most systems will need to provide a new filesystem for system use +mounted on +.Pn /var +(see below). +However, the +.Pn /tmp +filesystem can be an MFS virtual-memory-resident filesystem, +potentially freeing an existing disk partition. +(Additional swap space may be desirable as a consequence.) +See +.Xr mount_mfs (8). +.PP +The recommended installation procedure includes the following steps. +The order of these steps will probably vary according to local needs. +.IP \(bu +Extract root and +.Pn /usr +filesystems from the distribution tapes. +.IP \(bu +Extract kernel and/or user-level sources from the distribution tape +if space permits. +This can serve as the backup documentation as needed. +.IP \(bu +Configure and boot a kernel for the local system. +This can be delayed if the generic kernel from the distribution +supports enough hardware to proceed. +.IP \(bu +Build a skeletal +.Pn /var +filesystem (see +.Xr mtree (8)). +.IP \(bu +Merge site-dependent configuration files from +.Pn /etc +and +.Pn /usr/lib +into the new +.Pn /etc +directory. +Note that many file formats and contents have changed; see section 3.4 +of this document. +.IP \(bu +Copy or merge files from +.Pn /usr/adm , +.Pn /usr/spool , +.Pn /usr/preserve , +.Pn /usr/lib , +and other locations into +.Pn /var . +.IP \(bu +Merge local macros, dictionaries, etc. into +.Pn /usr/share . +.IP \(bu +Merge and update local software to reflect the system changes. +.IP \(bu +Take off the rest of the morning, you've earned it! +.PP +Section 3.2 lists the files to be saved as part of the conversion process. +Section 3.3 describes the bootstrap process. +Section 3.4 discusses the merger of the saved files back into the new system. +Section 3.5 gives an overview of the major +bug fixes and changes between \*(Ps and \*(4B. +Section 3.6 provides general hints on possible problems to be +aware of when converting from \*(Ps to \*(4B. +.Sh 2 "Files to save" +.PP +The following list enumerates the standard set of files you will want to +save and suggests directories in which site-specific files should be present. +This list will likely be augmented with non-standard files you +have added to your system. +If you do not have enough space to create parallel +filesystems, you should create a +.Xr tar +image of the following files before the new filesystems are created. +The rest of this subsection describes where theses files +have moved and how they have changed. +.TS +lfC c l. +/.cshrc \(dg root csh startup script (moves to \f(CW/root/.cshrc\fP) +/.login \(dg root csh login script (moves to \f(CW/root/.login\fP) +/.profile \(dg root sh startup script (moves to \f(CW/root/.profile\fP) +/.rhosts \(dg for trusted machines and users (moves to \f(CW/root/.rhosts\fP) +/etc/disktab \(dd in case you changed disk partition sizes +/etc/fstab * disk configuration data +/etc/ftpusers \(dg for local additions +/etc/gettytab \(dd getty database +/etc/group * group data base +/etc/hosts \(dg for local host information +/etc/hosts.equiv \(dg for local host equivalence information +/etc/hosts.lpd \(dg printer access file +/etc/inetd.conf * Internet services configuration data +/etc/named* \(dg named configuration files +/etc/netstart \(dg network initialization +/etc/networks \(dg for local network information +/etc/passwd * user data base +/etc/printcap * line printer database +/etc/protocols \(dd in case you added any local protocols +/etc/rc * for any local additions +/etc/rc.local * site specific system startup commands +/etc/remote \(dg auto-dialer configuration +/etc/services \(dd for local additions +/etc/shells \(dd list of valid shells +/etc/syslog.conf * system logger configuration +/etc/securettys * merged into ttys +/etc/ttys * terminal line configuration data +/etc/ttytype * merged into ttys +/etc/termcap \(dd for any local entries that may have been added +/lib \(dd for any locally developed language processors +/usr/dict/* \(dd for local additions to words and papers +/usr/include/* \(dd for local additions +/usr/lib/aliases * mail forwarding data base (moves to \f(CW/etc/aliases\fP) +/usr/lib/crontab * cron daemon data base (moves to \f(CW/etc/crontab\fP) +/usr/lib/crontab.local * local cron daemon data base (moves to \f(CW/etc/crontab.local\fP) +/usr/lib/lib*.a \(dg for local libraries +/usr/lib/mail.rc \(dg system-wide mail(1) initialization (moves to \f(CW/etc/mail.rc\fP) +/usr/lib/sendmail.cf * sendmail configuration (moves to \f(CW/etc/sendmail.cf\fP) +/usr/lib/tmac/* \(dd for locally developed troff/nroff macros (moves to \f(CW/usr/share/tmac/*\fP) +/usr/lib/uucp/* \(dg for local uucp configuration files +/usr/man/manl * for manual pages for locally developed programs (moves to \f(CW/usr/local/man\fP) +/usr/spool/* \(dg for current mail, news, uucp files, etc. (moves to \f(CW/var/spool\fP) +/usr/src/local \(dg for source for locally developed programs +/sys/conf/HOST \(dg configuration file for your machine (moves to \f(CW/sys/<arch>/conf\fP) +/sys/conf/files.HOST \(dg list of special files in your kernel (moves to \f(CW/sys/<arch>/conf\fP) +/*/quotas * filesystem quota files (moves to \f(CW/*/quotas.user\fP) +.TE +.DS +\(dg\|Files that can be used from \*(Ps without change. +\(dd\|Files that need local changes merged into \*(4B files. +*\|Files that require special work to merge and are discussed in section 3.4. +.DE +.Sh 2 "Installing \*(4B" +.PP +The next step is to build a working \*(4B system. +This can be done by following the steps in section 2 of +this document for extracting the root and +.Pn /usr +filesystems from the distribution tape onto unused disk partitions. +For the SPARC, the root filesystem dump on the tape could also be +extracted directly. +For the HP300 and DECstation, the raw disk image can be copied +into an unused partition and this partition can then be dumped +to create an image that can be restored. +The exact procedure chosen will depend on the disk configuration +and the number of suitable disk partitions that may be used. +It is also desirable to run filesystem checks +of all filesystems to be converted to \*(4B before shutting down. +In any case, this is an excellent time to review your disk configuration +for possible tuning of the layout. +Section 2.5 and +.Xr config (8) +are required reading. +.LP +The filesystem in \*(4B has been reorganized in an effort to +meet several goals: +.IP 1) +The root filesystem should be small. +.IP 2) +There should be a per-architecture centrally-shareable read-only +.Pn /usr +filesystem. +.IP 3) +Variable per-machine directories should be concentrated below +a single mount point named +.Pn /var . +.IP 4) +Site-wide machine independent shareable text files should be separated +from architecture specific binary files and should be concentrated below +a single mount point named +.Pn /usr/share . +.LP +These goals are realized with the following general layouts. +The reorganized root filesystem has the following directories: +.TS +lfC l. +/etc (config files) +/bin (user binaries needed when single-user) +/sbin (root binaries needed when single-user) +/local (locally added binaries used only by this machine) +/tmp (mount point for memory based filesystem) +/dev (local devices) +/home (mount point for AMD) +/var (mount point for per-machine variable directories) +/usr (mount point for multiuser binaries and files) +.TE +.LP +The reorganized +.Pn /usr +filesystem has the following directories: +.TS +lfC l. +/usr/bin (user binaries) +/usr/contrib (software contributed to \*(4B) +/usr/games (binaries for games, score files in \f(CW/var\fP) +/usr/include (standard include files) +/usr/lib (lib*.a from old \f(CW/usr/lib\fP) +/usr/libdata (databases from old \f(CW/usr/lib\fP) +/usr/libexec (executables from old \f(CW/usr/lib\fP) +/usr/local (locally added binaries used site-wide) +/usr/old (deprecated binaries) +/usr/sbin (root binaries) +/usr/share (mount point for site-wide shared text) +/usr/src (mount point for sources) +.TE +.LP +The reorganized +.Pn /usr/share +filesystem has the following directories: +.TS +lfC l. +/usr/share/calendar (various useful calendar files) +/usr/share/dict (dictionaries) +/usr/share/doc (\*(4B manual sources) +/usr/share/games (games text files) +/usr/share/groff_font (groff font information) +/usr/share/man (typeset manual pages) +/usr/share/misc (dumping ground for random text files) +/usr/share/mk (templates for \*(4B makefiles) +/usr/share/skel (template user home directory files) +/usr/share/tmac (various groff macro packages) +/usr/share/zoneinfo (information on time zones) +.TE +.LP +The reorganized +.Pn /var +filesystem has the following directories: +.TS +lfC l. +/var/account (accounting files, formerly \f(CW/usr/adm\fP) +/var/at (\fIat\fP\|(1) spooling area) +/var/backups (backups of system files) +/var/crash (crash dumps) +/var/db (system-wide databases, e.g. tags) +/var/games (score files) +/var/log (log files) +/var/mail (users mail) +/var/obj (hierarchy to build \f(CW/usr/src\fP) +/var/preserve (preserve area for vi) +/var/quotas (directory to store quota files) +/var/run (directory to store *.pid files) +/var/rwho (rwho databases) +/var/spool/ftp (home directory for anonymous ftp) +/var/spool/mqueue (sendmail spooling directory) +/var/spool/news (news spooling area) +/var/spool/output (printer spooling area) +/var/spool/uucp (uucp spooling area) +/var/tmp (disk-based temporary directory) +/var/users (root of per-machine user home directories) +.TE +.PP +The \*(4B bootstrap routines pass the identity of the boot device +through to the kernel. +The kernel then uses that device as its root filesystem. +Thus, for example, if you boot from +.Pn /dev/\*(Dk1a , +the kernel will use +.Pn \*(Dk1a +as its root filesystem. If +.Pn /dev/\*(Dk1b +is configured as a swap partition, +it will be used as the initial swap area, +otherwise the normal primary swap area (\c +.Pn /dev/\*(Dk0b ) +will be used. +The \*(4B bootstrap is backward compatible with \*(Ps, +so you can replace your old bootstrap if you use it +to boot your first \*(4B kernel. +However, the \*(Ps bootstrap cannot access \*(4B filesystems, +so if you plan to convert your filesystems to \*(4B, +you must install a new bootstrap \fIbefore\fP doing the conversion. +Note that SPARC users cannot build a \*(4B compatible version +of the bootstrap, so must \fInot\fP convert their root filesystem +to the new \*(4B format. +.PP +Once you have extracted the \*(4B system and booted from it, +you will have to build a kernel customized for your configuration. +If you have any local device drivers, +they will have to be incorporated into the new kernel. +See section 4.1.3 and ``Building 4.3BSD UNIX Systems with Config'' (SMM:2). +.PP +If converting from \*(Ps, your old filesystems should be converted. +If you've modified the partition +sizes from the original \*(Ps ones, and are not already using the +\*(4B disk labels, you will have to modify the default disk partition +tables in the kernel. Make the necessary table changes and boot +your custom kernel \fBBEFORE\fP trying to access any of your old +filesystems! After doing this, if necessary, the remaining filesystems +may be converted in place by running the \*(4B version of +.Xr fsck (8) +on each filesystem and allowing it to make the necessary corrections. +The new version of +.Xr fsck +is more strict about the size of directories than +the version supplied with \*(Ps. +Thus the first time that it is run on a \*(Ps filesystem, +it will produce messages of the form: +.DS +\fBDIRECTORY ...: LENGTH\fP xx \fBNOT MULTIPLE OF 512 (ADJUSTED)\fP +.DE +Length ``xx'' will be the size of the directory; +it will be expanded to the next multiple of 512 bytes. +The new +.Xr fsck +will also set default \fIinterleave\fP and +\fInpsect\fP (number of physical sectors per track) values on older +filesystems, in which these fields were unused spares; this correction +will produce messages of the form: +.DS +\fBIMPOSSIBLE INTERLEAVE=0 IN SUPERBLOCK (SET TO DEFAULT)\fP\** +\fBIMPOSSIBLE NPSECT=0 IN SUPERBLOCK (SET TO DEFAULT)\fP +.DE +.FS +The defaults are to set \fIinterleave\fP to 1 and +\fInpsect\fP to \fInsect\fP. +This is correct on most drives; +it affects only performance (usually virtually unmeasurably). +.FE +Filesystems that have had their interleave and npsect values +set will be diagnosed by the old +.Xr fsck +as having a bad superblock; the old +.Xr fsck +will run only if given an alternate superblock +(\fIfsck \-b32\fP), +in which case it will re-zero these fields. +The \*(4B kernel will internally set these fields to their defaults +if fsck has not done so; again, the \fI\-b32\fP option may be +necessary for running the old +.Xr fsck . +.PP +In addition, \*(4B removes several limits on filesystem sizes +that were present in \*(Ps. +The limited filesystems +continue to work in \*(4B, but should be converted +as soon as it is convenient +by running +.Xr fsck +with the \fI\-c 2\fP option. +The sequence \fIfsck \-p \-c 2\fP will update them all, +fix the interleave and npsect fields, +fix any incorrect directory lengths, +expand maximum uid's and gid's to 32-bits, +place symbolic links less than 60 bytes into their inode, +and fill in directory type fields all at once. +The new filesystem formats are incompatible with older systems. +If you wish to continue using these filesystems with the older +systems you should make only the compatible changes using +\fIfsck \-c 1\fP. +.Sh 2 "Merging your files from \*(Ps into \*(4B" +.PP +When your system is booting reliably and you have the \*(4B root and +.Pn /usr +filesystems fully installed you will be ready +to continue with the next step in the conversion process, +merging your old files into the new system. +.PP +If you saved the files on a +.Xr tar +tape, extract them into a scratch directory, say +.Pn /usr/convert : +.DS +\fB#\fP \fImkdir /usr/convert\fP +\fB#\fP \fIcd /usr/convert\fP +\fB#\fP \fItar xp\fP +.DE +.PP +The data files marked in the previous table with a dagger (\(dg) +may be used without change from the previous system. +Those data files marked with a double dagger (\(dd) have syntax +changes or substantial enhancements. +You should start with the \*(4B version and carefully +integrate any local changes into the new file. +Usually these local changes can be incorporated +without conflict into the new file; +some exceptions are noted below. +The files marked with an asterisk (*) require +particular attention and are discussed below. +.PP +As described in section 3.3, +the most immediately obvious change in \*(4B is the reorganization +of the system filesystems. +Users of certain recent vendor releases have seen this general organization, +although \*(4B takes the reorganization a bit further. +The directories most affected are +.Pn /etc , +that now contains only system configuration files; +.Pn /var , +a new filesystem containing per-system spool and log files; and +.Pn /usr/share, +that contains most of the text files shareable across architectures +such as documentation and macros. +System administration programs formerly in +.Pn /etc +are now found in +.Pn /sbin +and +.Pn /usr/sbin . +Various programs and data files formerly in +.Pn /usr/lib +are now found in +.Pn /usr/libexec +and +.Pn /usr/libdata , +respectively. +Administrative files formerly in +.Pn /usr/adm +are in +.Pn /var/account +and, similarly, log files are now in +.Pn /var/log . +The directory +.Pn /usr/ucb +has been merged into +.Pn /usr/bin , +and the sources for programs in +.Pn /usr/bin +are in +.Pn /usr/src/usr.bin . +Other source directories parallel the destination directories; +.Pn /usr/src/etc +has been greatly expanded, and +.Pn /usr/src/share +is new. +The source for the manual pages, in general, are with the source +code for the applications they document. +Manual pages not closely corresponding to an application program +are found in +.Pn /usr/src/share/man . +The locations of all man pages is listed in +.Pn /usr/src/share/man/man0/man[1-8] . +The manual page +.Xr hier (7) +has been updated and made more detailed; +it is included in the printed documentation. +You should review it to familiarize yourself with the new layout. +.PP +A new utility, +.Xr mtree (8), +is provided to build and check filesystem hierarchies +with the proper contents, owners and permissions. +Scripts are provided in +.Pn /etc/mtree +(and +.Pn /usr/src/etc/mtree ) +for the root, +.Pn /usr +and +.Pn /var +filesystems. +Once a filesystem has been made for +.Pn /var , +.Xr mtree +can be used to create a directory hierarchy there +or you can simply use tar to extract the prototype from +the second file of the distribution tape. +.Sh 3 "Changes in the \f(CW/etc\fP directory" +.PP +The +.Pn /etc +directory now contains nearly all the host-specific configuration +files. +Note that some file formats have changed, +and those configuration files containing pathnames are nearly all affected +by the reorganization. +See the examples provided in +.Pn /etc +(installed from +.Pn /usr/src/etc ) +as a guide. +The following table lists some of the local configuration files +whose locations and/or contents have changed. +.TS +l l l +lfC lfC l. +\*(Ps and Earlier \*(4B Comments +_ _ _ +/etc/fstab /etc/fstab new format; see below +/etc/inetd.conf /etc/inetd.conf pathnames of executables changed +/etc/printcap /etc/printcap pathnames changed +/etc/syslog.conf /etc/syslog.conf pathnames of log files changed +/etc/ttys /etc/ttys pathnames of executables changed +/etc/passwd /etc/master.passwd new format; see below +/usr/lib/sendmail.cf /etc/sendmail.cf changed pathnames +/usr/lib/aliases /etc/aliases may contain changed pathnames +/etc/*.pid /var/run/*.pid + +.T& +l l l +lfC lfC l. +New in \*(Ps-Tahoe \*(4B Comments +_ _ _ +/usr/games/dm.config /etc/dm.conf configuration for games (see \fIdm\fP\|(8)) +/etc/zoneinfo/localtime /etc/localtime timezone configuration +/etc/zoneinfo /usr/share/zoneinfo timezone configuration +.TE +.ne 1.5i +.TS +l l l +lfC lfC l. + New in \*(4B Comments +_ _ _ + /etc/aliases.db database version of the aliases file + /etc/amd-home location database of home directories + /etc/amd-vol location database of exported filesystems + /etc/changelist \f(CW/etc/security\fP files to back up + /etc/csh.cshrc system-wide csh(1) initialization file + /etc/csh.login system-wide csh(1) login file + /etc/csh.logout system-wide csh(1) logout file + /etc/disklabels directory for saving disklabels + /etc/exports NFS list of export permissions + /etc/ftpwelcome message displayed for ftp users; see ftpd(8) + /etc/kerberosIV Kerberos directory; see below + /etc/man.conf lists directories searched by \fIman\fP\|(1) + /etc/mtree directory for local mtree files; see mtree(8) + /etc/netgroup NFS group list used in \f(CW/etc/exports\fP + /etc/pwd.db non-secure hashed user data base file + /etc/spwd.db secure hashed user data base file + /etc/security daily system security checker +.TE +.PP +System security changes require adding several new ``well-known'' groups to +.Pn /etc/group . +The groups that are needed by the system as distributed are: +.TS +l n l. +name number purpose +_ +wheel 0 users allowed superuser privilege +daemon 1 processes that need less than wheel privilege +kmem 2 read access to kernel memory +sys 3 access to kernel sources +tty 4 access to terminals +operator 5 read access to raw disks +bin 7 group for system binaries +news 8 group for news +wsrc 9 write access to sources +games 13 access to games +staff 20 system staff +guest 31 system guests +nobody 39 the least privileged group +utmp 45 access to utmp files +dialer 117 access to remote ports and dialers +.TE +Only users in the ``wheel'' group are permitted to +.Xr su +to ``root''. +Most programs that manage directories in +.Pn /var/spool +now run set-group-id to ``daemon'' so that users cannot +directly access the files in the spool directories. +The special files that access kernel memory, +.Pn /dev/kmem +and +.Pn /dev/mem , +are made readable only by group ``kmem''. +Standard system programs that require this access are +made set-group-id to that group. +The group ``sys'' is intended to control access to kernel sources, +and other sources belong to group ``wsrc.'' +Rather than make user terminals writable by all users, +they are now placed in group ``tty'' and made only group writable. +Programs that should legitimately have access to write on user terminals +such as +.Xr talkd +and +.Xr write +now run set-group-id to ``tty''. +The ``operator'' group controls access to disks. +By default, disks are readable by group ``operator'', +so that programs such as +.Xr dump +can access the filesystem information without being set-user-id to ``root''. +The +.Xr shutdown (8) +program is executable only by group operator +and is setuid to root so that members of group operator may shut down +the system without root access. +.PP +The ownership and modes of some directories have changed. +The +.Xr at +programs now run set-user-id ``root'' instead of ``daemon.'' +Also, the uucp directory no longer needs to be publicly writable, +as +.Xr tip +reverts to privileged status to remove its lock files. +After copying your version of +.Pn /var/spool , +you should do: +.DS +\fB#\fP \fIchown \-R root /var/spool/at\fP +\fB#\fP \fIchown \-R uucp.daemon /var/spool/uucp\fP +\fB#\fP \fIchmod \-R o\-w /var/spool/uucp\fP +.DE +.PP +The format of the cron table, +.Pn /etc/crontab , +has been changed to specify the user-id that should be used to run a process. +The userid ``nobody'' is frequently useful for non-privileged programs. +Local changes are now put in a separate file, +.Pn /etc/crontab.local . +.PP +Some of the commands previously in +.Pn /etc/rc.local +have been moved to +.Pn /etc/rc ; +several new functions are now handled by +.Pn /etc/rc , +.Pn /etc/netstart +and +.Pn /etc/rc.local . +You should look closely at the prototype version of these files +and read the manual pages for the commands contained in it +before trying to merge your local copy. +Note in particular that +.Xr ifconfig +has had many changes, +and that host names are now fully specified as domain-style names +(e.g., vangogh.CS.Berkeley.EDU) for the benefit of the name server. +.PP +Some of the commands previously in +.Pn /etc/daily +have been moved to +.Pn /etc/security , +and several new functions have been added to +.Pn /etc/security +to do nightly security checks on the system. +The script +.Pn /etc/daily +runs +.Pn /etc/security +each night, and mails the output to the super-user. +Some of the checks done by +.Pn /etc/security +are: +.DS +\(bu Syntax errors in the password and group files. +\(bu Duplicate user and group names and id's. +\(bu Dangerous search paths and umask values for the superuser. +\(bu Dangerous values in various initialization files. +\(bu Dangerous .rhosts files. +\(bu Dangerous directory and file ownership or permissions. +\(bu Globally exported filesystems. +\(bu Dangerous owners or permissions for special devices. +.DE +In addition, it reports any changes to setuid and setgid files, special +devices, or the files in +.Pn /etc/changelist +since the last run of +.Pn /etc/security . +Backup copies of the files are saved in +.Pn /var/backups . +Finally, the system binaries are checksummed and their permissions +validated against the +.Xr mtree (8) +specifications in +.Pn /etc/mtree . +.PP +The C-library and system binaries on the distribution tape +are compiled with new versions of +.Xr gethostbyname +and +.Xr gethostbyaddr +that use the name server, +.Xr named (8). +If you have only a small network and are not connected +to a large network, you can use the distributed library routines without +any problems; they use a linear scan of the host table +.Pn /etc/hosts +if the name server is not running. +If you are on the Internet or have a large local network, +it is recommend that you set up +and use the name server. +For instructions on how to set up the necessary configuration files, +refer to ``Name Server Operations Guide for BIND'' (SMM:10). +Several programs rely on the host name returned by +.Xr gethostname +to determine the local domain name. +.PP +If you are using the name server, your +.Xr sendmail +configuration file will need some updates to accommodate it. +See the ``Sendmail Installation and Operation Guide'' (SMM:8) and +the sample +.Xr sendmail +configuration files in +.Pn /usr/src/usr.sbin/sendmail/cf . +The aliases file, +.Pn /etc/aliases +has also been changed to add certain well-known addresses. +.Sh 3 "Shadow password files" +.PP +The password file format adds change and expiration fields +and its location has changed to protect +the encrypted passwords stored there. +The actual password file is now stored in +.Pn /etc/master.passwd . +The hashed dbm password files do not contain encrypted passwords, +but contain the file offset to the entry with the password in +.Pn /etc/master.passwd +(that is readable only by root). +Thus, the +.Fn getpwnam +and +.Fn getpwuid +functions will no longer return an encrypted password string to non-root +callers. +An old-style passwd file is created in +.Pn /etc/passwd +by the +.Xr vipw (8) +and +.Xr pwd_mkdb (8) +programs. +See also +.Xr passwd (5). +.PP +Several new users have also been added to the group of ``well-known'' users in +.Pn /etc/passwd . +The current list is: +.DS +.TS +l c. +name number +_ +root 0 +daemon 1 +operator 2 +bin 3 +games 7 +uucp 66 +nobody 32767 +.TE +.DE +The ``daemon'' user is used for daemon processes that +do not need root privileges. +The ``operator'' user-id is used as an account for dumpers +so that they can log in without having the root password. +By placing them in the ``operator'' group, +they can get read access to the disks. +The ``uucp'' login has existed long before \*(4B, +and is noted here just to provide a common user-id. +The password entry ``nobody'' has been added to specify +the user with least privilege. The ``games'' user is a pseudo-user +that controls access to game programs. +.PP +After installing your updated password file, you must run +.Xr pwd_mkdb (8) +to create the password database. +Note that +.Xr pwd_mkdb (8) +is run whenever +.Xr vipw (8) +is run. +.Sh 3 "The \f(CW/var\fP filesystem" +.PP +The spooling directories saved on tape may be restored in their +eventual resting places without too much concern. Be sure to +use the `\-p' option to +.Xr tar (1) +so that files are recreated with the same file modes. +The following commands provide a guide for copying spool and log files from +an existing system into a new +.Pn /var +filesystem. +At least the following directories should already exist on +.Pn /var : +.Pn output , +.Pn log , +.Pn backups +and +.Pn db . +.LP +.DS +.ft CW +SRC=/oldroot/usr + +cd $SRC; tar cf - msgs preserve | (cd /var && tar xpf -) +.DE +.DS +.ft CW +# copy $SRC/spool to /var +cd $SRC/spool +tar cf - at mail rwho | (cd /var && tar xpf -) +tar cf - ftp mqueue news secretmail uucp uucppublic | \e + (cd /var/spool && tar xpf -) +.DE +.DS +.ft CW +# everything else in spool is probably a printer area +mkdir .save +mv at ftp mail mqueue rwho secretmail uucp uucppublic .save +tar cf - * | (cd /var/spool/output && tar xpf -) +mv .save/* . +rmdir .save +.DE +.DS +.ft CW +cd /var/spool/mqueue +mv syslog.7 /var/log/maillog.7 +mv syslog.6 /var/log/maillog.6 +mv syslog.5 /var/log/maillog.5 +mv syslog.4 /var/log/maillog.4 +mv syslog.3 /var/log/maillog.3 +mv syslog.2 /var/log/maillog.2 +mv syslog.1 /var/log/maillog.1 +mv syslog.0 /var/log/maillog.0 +mv syslog /var/log/maillog +.DE +.DS +.ft CW +# move $SRC/adm to /var +cd $SRC/adm +tar cf - . | (cd /var/account && tar xpf -) +cd /var/account +rm -f msgbuf +mv messages messages.[0-9] ../log +mv wtmp wtmp.[0-9] ../log +mv lastlog ../log +.DE +.Sh 2 "Bug fixes and changes between \*(Ps and \*(4B" +.PP +The major new facilities available in the \*(4B release are +a new virtual memory system, +the addition of ISO/OSI networking support, +a new virtual filesystem interface supporting filesystem stacking, +a freely redistributable implementation of NFS, +a log-structured filesystem, +enhancement of the local filesystems to support +files and filesystems that are up to 2^63 bytes in size, +enhanced security and system management support, +and the conversion to and addition of the IEEE Std1003.1 (``POSIX'') +facilities and many of the IEEE Std1003.2 facilities. +In addition, many new utilities and additions to the C +library are present as well. +The kernel sources have been reorganized to collect all machine-dependent +files for each architecture under one directory, +and most of the machine-independent code is now free of code +conditional on specific machines. +The user structure and process structure have been reorganized +to eliminate the statically-mapped user structure and to make most +of the process resources shareable by multiple processes. +The system and include files have been converted to be compatible +with ANSI C, including function prototypes for most of the exported +functions. +There are numerous other changes throughout the system. +.Sh 3 "Changes to the kernel" +.PP +This release includes several important structural kernel changes. +The kernel uses a new internal system call convention; +the use of global (``u-dot'') variables for parameters and error returns +has been eliminated, +and interrupted system calls no longer abort using non-local goto's (longjmp's). +A new sleep interface separates signal handling from scheduling priority, +returning characteristic errors to abort or restart the current system call. +This sleep call also passes a string describing the process state, +that is used by the ps(1) program. +The old sleep interface can be used only for non-interruptible sleeps. +The sleep interface (\fItsleep\fP) can be used at any priority, +but is only interruptible if the PCATCH flag is set. +When interrupted, \fItsleep\fP returns EINTR or ERESTART. +.PP +Many data structures that were previously statically allocated +are now allocated dynamically. +These structures include mount entries, file entries, +user open file descriptors, the process entries, the vnode table, +the name cache, and the quota structures. +.PP +To protect against indiscriminate reading or writing of kernel +memory, all writing and most reading of kernel data structures +must be done using a new ``sysctl'' interface. +The information to be accessed is described through an extensible +``Management Information Base'' (MIB) style name, +described as a dotted set of components. +A new utility, +.Xr sysctl (8), +retrieves kernel state and allows processes with appropriate +privilege to set kernel state. +.Sh 3 "Security" +.PP +The kernel runs with four different levels of security. +Any superuser process can raise the security level, but only +.Fn init (8) +can lower it. +Security levels are defined as follows: +.IP \-1 +Permanently insecure mode \- always run system in level 0 mode. +.IP " 0" +Insecure mode \- immutable and append-only flags may be turned off. +All devices may be read or written subject to their permissions. +.IP " 1" +Secure mode \- immutable and append-only flags may not be cleared; +disks for mounted filesystems, +.Pn /dev/mem , +and +.Pn /dev/kmem +are read-only. +.IP " 2" +Highly secure mode \- same as secure mode, plus disks are always +read-only whether mounted or not. +This level precludes tampering with filesystems by unmounting them, +but also inhibits running +.Xr newfs (8) +while the system is multi-user. +See +.Xr chflags (1) +and the \-\fBo\fP option to +.Xr ls (1) +for information on setting and displaying the immutable and append-only +flags. +.PP +Normally, the system runs in level 0 mode while single user +and in level 1 mode while multiuser. +If the level 2 mode is desired while running multiuser, +it can be set in the startup script +.Pn /etc/rc +using +.Xr sysctl (1). +If it is desired to run the system in level 0 mode while multiuser, +the administrator must build a kernel with the variable +.Li securelevel +in the kernel source file +.Pn /sys/kern/kern_sysctl.c +initialized to \-1. +.Sh 4 "Virtual memory changes" +.PP +The new virtual memory implementation is derived from the Mach +operating system developed at Carnegie-Mellon, +and was ported to the BSD kernel at the University of Utah. +It is based on the 2.0 release of Mach +(with some bug fixes from the 2.5 and 3.0 releases) +and retains many of its essential features such as +the separation of the machine dependent and independent layers +(the ``pmap'' interface), +efficient memory utilization using copy-on-write +and other lazy-evaluation techniques, +and support for large, sparse address spaces. +It does not include the ``external pager'' interface instead using +a primitive internal pager interface. +The Mach virtual memory system call interface has been replaced with the +``mmap''-based interface described in the ``Berkeley Software +Architecture Manual'' (see UNIX Programmer's Manual, +Supplementary Documents, PSD:5). +The interface is similar to the interfaces shipped +by several commercial vendors such as Sun, USL, and Convex Computer Corp. +The integration of the new virtual memory is functionally complete, +but still has serious performance problems under heavy memory load. +The internal kernel interfaces have not yet been completed +and the memory pool and buffer cache have not been merged. +Some additional caveats: +.IP \(bu +Since the code is based on the 2.0 release of Mach, +bugs and misfeatures of the BSD version should not be considered +short-comings of the current Mach virtual memory system. +.IP \(bu +Because of the disjoint virtual memory (page) and IO (buffer) caches, +it is possible to see inconsistencies if using both the mmap and +read/write interfaces on the same file simultaneously. +.IP \(bu +Swap space is allocated on-demand rather than up front and no +allocation checks are performed so it is possible to over-commit +memory and eventually deadlock. +.IP \(bu +The semantics of the +.Xr vfork (2) +system call are slightly different. +The synchronization between parent and child is preserved, +but the memory sharing aspect is not. +In practice this has been enough for backward compatibility, +but newer code should just use +.Xr fork (2). +.Sh 4 "Networking additions and changes" +.PP +The ISO/OSI Networking consists of a kernel implementation of +transport class 4 (TP-4), +connectionless networking protocol (CLNP), +and 802.3-based link-level support (hardware-compatible with Ethernet\**). +.FS +Ethernet is a trademark of the Xerox Corporation. +.FE +We also include support for ISO Connection-Oriented Network Service, +X.25, TP-0. +The session and presentation layers are provided outside +the kernel using the ISO Development Environment by Marshall Rose, +that is available via anonymous FTP +(but is not included on the distribution tape). +Included in this development environment are file +transfer and management (FTAM), virtual terminals (VT), +a directory services implementation (X.500), +and miscellaneous other utilities. +.PP +Kernel support for the ISO OSI protocols is enabled with the ISO option +in the kernel configuration file. +The +.Xr iso (4) +manual page describes the protocols and addressing; +see also +.Xr clnp (4), +.Xr tp (4) +and +.Xr cltp (4). +The OSI equivalent to ARP is ESIS (End System to Intermediate System Routing +Protocol); running this protocol is mandatory, however one can manually add +translations for machines that do not participate by use of the +.Xr route (8) +command. +Additional information is provided in the manual page describing +.Xr esis (4). +.PP +The command +.Xr route (8) +has a new syntax and several new capabilities: +it can install routes with a specified destination and mask, +and can change route characteristics such as hop count, packet size +and window size. +.PP +Several important enhancements have been added to the TCP/IP +protocols including TCP header prediction and +serial line IP (SLIP) with header compression. +The routing implementation has been completely rewritten +to use a hierarchical routing tree with a mask per route +to support the arbitrary levels of routing found in the ISO protocols. +The routing table also stores and caches route characteristics +to speed the adaptation of the throughput and congestion avoidance +algorithms. +.PP +The format of the +.I sockaddr +structure (the structure used to describe a generic network address with an +address family and family-specific data) +has changed from previous releases, +as have the address family-specific versions of this structure. +The +.I sa_family +family field has been split into a length, +.Pn sa_len , +and a family, +.Pn sa_family . +System calls that pass a +.I sockaddr +structure into the kernel (e.g. +.Fn sendto +and +.Fn connect ) +have a separate parameter that specifies the +.I sockaddr +length, and thus it is not necessary to fill in the +.I sa_len +field for those system calls. +System calls that pass a +.I sockaddr +structure back from the kernel (e.g. +.Fn recvfrom +and +.Fn accept ) +receive a completely filled-in +.I sockaddr +structure, thus the length field is valid. +Because this would not work for old binaries, +the new library uses a different system call number. +Thus, most networking programs compiled under \*(4B are incompatible +with older systems. +.PP +Although this change is mostly source and binary compatible +with old programs, there are three exceptions. +Programs with statically initialized +.I sockaddr +structures +(usually the Internet form, a +.I sockaddr_in ) +are not compatible. +Generally, such programs should be changed to fill in the structure +at run time, as C allows no way to initialize a structure without +assuming the order and number of fields. +Also, programs with use structures to describe a network packet format +that contain embedded +.I sockaddr +structures also require change; a definition of an +.I osockaddr +structure is provided for this purpose. +Finally, programs that use the +.Sm SIOCGIFCONF +ioctl to get a complete list of interface addresses +need to check the +.I sa_len +field when iterating through the array of addresses returned, +as not all the structures returned have the same length +(this variance in length is nearly guaranteed by the presence of link-layer +address structures). +.Sh 4 "Additions and changes to filesystems" +.PP +The \*(4B distribution contains most of the interfaces +specified in the IEEE Std1003.1 system interface standard. +Filesystem additions include IEEE Std1003.1 FIFOs, +byte-range file locking, and saved user and group identifiers. +.PP +A new virtual filesystem interface has been added to the +kernel to support multiple filesystems. +In comparison with other interfaces, +the Berkeley interface has been structured for more efficient support +of filesystems that maintain state (such as the local filesystem). +The interface has been extended with support for stackable +filesystems done at UCLA. +These extensions allow for filesystems to be layered on top of each +other and allow new vnode operations to be added without requiring +changes to existing filesystem implementations. +For example, +the umap filesystem (see +.Xr mount_umap (8)) +is used to mount a sub-tree of an existing filesystem +that uses a different set of uids and gids than the local system. +Such a filesystem could be mounted from a remote site via NFS or it +could be a filesystem on removable media brought from some foreign +location that uses a different password file. +.PP +Other new filesystems that may be stacked include the loopback filesystem +.Xr mount_lofs (8), +the kernel filesystem +.Xr mount_kernfs (8), +and the portal filesystem +.Xr mount_portal (8). +.PP +The buffer cache in the kernel is now organized as a file block cache +rather than a device block cache. +As a consequence, cached blocks from a file +and from the corresponding block device would no longer be kept consistent. +The block device thus has little remaining value. +Three changes have been made for these reasons: +.IP 1) +block devices may not be opened while they are mounted, +and may not be mounted while open, so that the two versions of cached +file blocks cannot be created, +.IP 2) +filesystem checks of the root now use the raw device +to access the root filesystem, and +.IP 3) +the root filesystem is initially mounted read-only +so that nothing can be written back to disk during or after change to +the raw filesystem by +.Xr fsck . +.LP +The root filesystem may be made writable while in single-user mode +with the command: +.DS +.ft CW +mount \-uw / +.DE +The mount command has an option to update the flags on a mounted filesystem, +including the ability to upgrade a filesystem from read-only to read-write +or downgrade it from read-write to read-only. +.PP +In addition to the local ``fast filesystem'', +we have added an implementation of the network filesystem (NFS) +that fully interoperates with the NFS shipped by Sun and its licensees. +Because our NFS implementation was implemented +by Rick Macklem of the University of Guelph +using only the publicly available NFS specification, +it does not require a license from Sun to use in source or binary form. +By default it runs over UDP to be compatible with Sun's implementation. +However, it can be configured on a per-mount basis to run over TCP. +Using TCP allows it to be used quickly and efficiently through +gateways and over long-haul networks. +Using an extended protocol, it supports Leases to allow a limited +callback mechanism that greatly reduces the network traffic necessary +to maintain cache consistency between the server and its clients. +Its use will be familiar to users of other implementations of NFS. +See the manual pages +.Xr mount (8), +.Xr mountd (8), +.Xr fstab (5), +.Xr exports (5), +.Xr netgroup (5), +.Xr nfsd (8), +.Xr nfsiod (8), +and +.Xr nfssvc (8). +and the document ``The 4.4BSD NFS Implementation'' (SMM:6) +for further information. +The format of +.Pn /etc/fstab +has changed from previous \*(Bs releases +to a blank-separated format to allow colons in pathnames. +.PP +A new local filesystem, the log-structured filesystem (LFS), +has been added to the system. +It provides near disk-speed output and fast crash recovery. +This work is based, in part, on the LFS filesystem created +for the Sprite operating system at Berkeley. +While the kernel implementation is almost complete, +only some of the utilities to support the +filesystem have been written, +so we do not recommend it for production use. +See +.Xr newlfs (8), +.Xr mount_lfs (8) +and +.Xr lfs_cleanerd (8) +for more information. +For a in-depth description of the implementation and performance +characteristics of log-structured filesystems in general, +and this one in particular, see Dr. Margo Seltzer's doctoral thesis, +available from the University of California Computer Science Department. +.PP +We have also added a memory-based filesystem that runs in +pageable memory, allowing large temporary filesystems without +requiring dedicated physical memory. +.PP +The local ``fast filesystem'' has been enhanced to do +clustering that allows large pieces of files to be +allocated contiguously resulting in near doubling +of filesystem throughput. +The filesystem interface has been extended to allow +files and filesystems to grow to 2^63 bytes in size. +The quota system has been rewritten to support both +user and group quotas (simultaneously if desired). +Quota expiration is based on time rather than +the previous metric of number of logins over quota. +This change makes quotas more useful on fileservers +onto which users seldom login. +.PP +The system security has been greatly enhanced by the +addition of additional file flags that permit a file to be +marked as immutable or append only. +Once set, these flags can only be cleared by the super-user +when the system is running in insecure mode (normally, single-user). +In addition to the immutable and append-only flags, +the filesystem supports a new user-settable flag ``nodump''. +(File flags are set using the +.Xr chflags (1) +utility.) +When set on a file, +.Xr dump (8) +will omit the file from incremental backups +but retain them on full backups. +See the ``-h'' flag to +.Xr dump (8) +for details on how to change this default. +The ``nodump'' flag is usually set on core dumps, +system crash dumps, and object files generated by the compiler. +Note that the flag is not preserved when files are copied +so that installing an object file will cause it to be preserved. +.PP +The filesystem format used in \*(4B has several additions. +Directory entries have an additional field, +.Pn d_type , +that identifies the type of the entry +(normally found in the +.Pn st_mode +field of the +.Pn stat +structure). +This field is particularly useful for identifying +directories without the need to use +.Xr stat (2). +.PP +Short (less than sixty byte) symbolic links are now stored +in the inode itself rather than in a separate data block. +This saves disk space and makes access of symbolic links faster. +Short symbolic links are not given a special type, +so a user-level application is unaware of their special treatment. +Unlike pre-\*(4B systems, symbolic links do +not have an owner, group, access mode, times, etc. +Instead, these attributes are taken from the directory that contains the link. +The only attributes returned from an +.Xr lstat (2) +that refer to the symbolic link itself are the file type (S_IFLNK), +size, blocks, and link count (always 1). +.PP +An implementation of an auto-mounter daemon, +.Xr amd , +was contributed by Jan-Simon Pendry of the +Imperial College of Science, Technology & Medicine. +See the document ``AMD \- The 4.4BSD Automounter'' (SMM:13) +for further information. +.PP +The directory +.Pn /dev/fd +contains special files +.Pn 0 +through +.Pn 63 +that, when opened, duplicate the corresponding file descriptor. +The names +.Pn /dev/stdin , +.Pn /dev/stdout +and +.Pn /dev/stderr +refer to file descriptors 0, 1 and 2. +See +.Xr fd (4) +and +.Xr mount_fdesc (8) +for more information. +.Sh 4 "POSIX terminal driver changes" +.PP +The \*(4B system uses the IEEE P1003.1 (POSIX.1) terminal interface +rather than the previous \*(Bs terminal interface. +The terminal driver is similar to the System V terminal driver +with the addition of the necessary extensions to get the +functionality previously available in the \*(Ps terminal driver. +Both the old +.Xr ioctl +calls and old options to +.Xr stty (1) +are emulated. +This emulation is expected to be unavailable in many vendors releases, +so conversion to the new interface is encouraged. +.PP +\*(4B also adds the IEEE Std1003.1 job control interface, +that is similar to the \*(Ps job control interface, +but adds a security model that was missing in the +\*(Ps job control implementation. +A new system call, +.Fn setsid , +creates a job-control session consisting of a single process +group with one member, the caller, that becomes a session leader. +Only a session leader may acquire a controlling terminal. +This is done explicitly via a +.Sm TIOCSCTTY +.Fn ioctl +call, not implicitly by an +.Fn open +call. +The call fails if the terminal is in use. +Programs that allocate controlling terminals (or pseudo-terminals) +require change to work in this environment. +The versions of +.Xr xterm +provided in the X11R5 release includes the necessary changes. +New library routines are available for allocating and initializing +pseudo-terminals and other terminals as controlling terminal; see +.Pn /usr/src/lib/libutil/pty.c +and +.Pn /usr/src/lib/libutil/login_tty.c . +.PP +The POSIX job control model formalizes the previous conventions +used in setting up a process group. +Unfortunately, this requires that changes be made in a defined order +and with some synchronization that were not necessary in the past. +Older job control shells (csh, ksh) will generally not operate correctly +with the new system. +.PP +Most of the other kernel interfaces have been changed to correspond +with the POSIX.1 interface, although that work is not complete. +See the relevant manual pages and the IEEE POSIX standard. +.Sh 4 "Native operating system compatibility" +.PP +Both the HP300 and SPARC ports feature the ability to run binaries +built for the native operating system (HP-UX or SunOS) by emulating +their system calls. +Building an HP300 kernel with the HPUXCOMPAT and COMPAT_OHPUX options +or a SPARC kernel with the COMPAT_SUNOS option will enable this feature +(on by default in the generic kernel provided in the root filesystem image). +Though this native operating system compatibility was provided by the +developers as needed for their purposes and is by no means complete, +it is complete enough to run several non-trivial applications including +those that require HP-UX or SunOS shared libraries. +For example, the vendor supplied X11 server and windowing environment +can be used on both the HP300 and SPARC. +.PP +It is important to remember that merely copying over a native binary +and executing it (or executing it directly across NFS) does not imply +that it will run. +All but the most trivial of applications are likely to require access +to auxiliary files that do not exist under \*(4B (e.g. +.Pn /etc/ld.so.cache ) +or have a slightly different format (e.g. +.Pn /etc/passwd ). +However, by using system call tracing and +through creative use of symlinks, +many problems can be tracked down and corrected. +.PP +The DECstation port also has code for ULTRIX emulation +(kernel option ULTRIXCOMPAT, not compiled into the generic kernel) +but it was used primarily for initially bootstrapping the port and +has not been used since. +Hence, some work may be required to make it generally useful. +.Sh 3 "Changes to the utilities" +.PP +We have been tracking the IEEE Std1003.2 shell and utility work +and have included prototypes of many of the proposed utilities +based on draft 12 of the POSIX.2 Shell and Utilities document. +Because most of the traditional utilities have been replaced +with implementations conformant to the POSIX standards, +you should realize that the utility software may not be as stable, +reliable or well documented as in traditional Berkeley releases. +In particular, almost the entire manual suite has been rewritten to +reflect the POSIX defined interfaces, and in some instances +it does not correctly reflect the current state of the software. +It is also worth noting that, in rewriting this software, we have generally +been rewarded with significant performance improvements. +Most of the libraries and header files have been converted +to be compliant with ANSI C. +The shipped compiler (gcc) is a superset of ANSI C, +but supports traditional C as a command-line option. +The system libraries and utilities all compile +with either ANSI or traditional C. +.Sh 4 "Make and Makefiles" +.PP +This release uses a completely new version of the +.Xr make +program derived from the +.Xr pmake +program developed by the Sprite project at Berkeley. +It supports existing makefiles, although certain incorrect makefiles +may fail. +The makefiles for the \*(4B sources make extensive use of the new +facilities, especially conditionals and file inclusion, and are thus +completely incompatible with older versions of +.Xr make +(but nearly all the makefiles are now trivial!). +The standard include files for +.Xr make +are in +.Pn /usr/share/mk . +There is a +.Pn bsd.README +file in +.Pn /usr/src/share/mk . +.PP +Another global change supported by the new +.Xr make +is designed to allow multiple architectures to share a copy of the sources. +If a subdirectory named +.Pn obj +is present in the current directory, +.Xr make +descends into that directory and creates all object and other files there. +We use this by building a directory hierarchy in +.Pn /var/obj +that parallels +.Pn /usr/src . +We then create the +.Pn obj +subdirectories in +.Pn /usr/src +as symbolic links to the corresponding directories in +.Pn /var/obj . +(This step is automated. +The command ``make obj'' in +.Pn /usr/src +builds both the local symlink and the shadow directory, +using +.Pn /usr/obj , +that may be a symbolic link, as the root of the shadow tree. +The use of +.Pn /usr/obj +is for historic reasons only, and the system make configuration files in +.Pn /usr/share/mk +can trivially be modified to use +.Pn /var/obj +instead.) +We have one +.Pn /var/obj +hierarchy on the local system, and another on each +system that shares the source filesystem. +All the sources in +.Pn /usr/src +except for +.Pn /usr/src/contrib +and portions of +.Pn /usr/src/old +have been converted to use the new make and +.Pn obj +subdirectories; +this change allows compilation for multiple +architectures from the same source tree +(that may be mounted read-only). +.Sh 4 "Kerberos" +.PP +The Kerberos authentication server from MIT (version 4) +is included in this release. +See +.Xr kerberos (1) +for a general, if MIT-specific, introduction. +If it is configured, +.Xr login (1), +.Xr passwd (1), +.Xr rlogin (1) +and +.Xr rsh (1) +will all begin to use it automatically. +The file +.Pn /etc/kerberosIV/README +describes the configuration. +Each system needs the file +.Pn /etc/kerberosIV/krb.conf +to set its realm and local servers, +and a private key stored in +.Pn /etc/kerberosIV/srvtab +(see +.Xr ext_srvtab (8)). +The Kerberos server should be set up on a single, physically secure, +server machine. +Users and hosts may be added to the server database manually with +.Xr kdb_edit (8), +or users on authorized hosts can add themselves and a Kerberos +password after verification of their ``local'' (passwd-file) password +using the +.Xr register (1) +program. +.PP +Note that by default the password-changing program +.Xr passwd (1) +changes the Kerberos password, that must exist. +The +.Li \-l +option to +.Xr passwd (1) +changes the ``local'' password if one exists. +.PP +Note that Version 5 of Kerberos will be released soon; +Version 4 should probably be replaced at that time. +.Sh 4 "Timezone support" +.PP +The timezone conversion code in the C library uses data files installed in +.Pn /usr/share/zoneinfo +to convert from ``GMT'' to various timezones. The data file for the default +timezone for the system should be copied to +.Pn /etc/localtime . +Other timezones can be selected by setting the TZ environment variable. +.PP +The data files initially installed in +.Pn /usr/share/zoneinfo +include corrections for leap seconds since the beginning of 1970. +Thus, they assume that the +kernel will increment the time at a constant rate during a leap second; +that is, time just keeps on ticking. The conversion routines will then +name a leap second 23:59:60. For purists, this effectively means that +the kernel maintains TAI (International Atomic Time) rather than UTC +(Coordinated Universal Time, aka GMT). +.PP +For systems that run current NTP (Network Time Protocol) implementations +or that wish to conform to the letter of the POSIX.1 law, it is possible +to rebuild the timezone data files so that leap seconds are not counted. +(NTP causes the time to jump over a leap second, and POSIX effectively +requires the clock to be reset by hand when a leap second occurs. +In this mode, the kernel effectively runs UTC rather than TAI.) +.PP +The data files without leap second information +are constructed from the source directory, +.Pn /usr/src/share/zoneinfo . +Change the variable REDO in Makefile +from ``right'' to ``posix'', and then do +.DS +make obj (if necessary) +make +make install +.DE +.PP +You will then need to copy the correct default zone file to +.Pn /etc/localtime , +as the old one would still have used leap seconds, and because the Makefile +installs a default +.Pn /etc/localtime +each time ``make install'' is done. +.PP +It is possible to install both sets of timezone data files. This results +in subdirectories +.Pn /usr/share/zoneinfo/right +and +.Pn /usr/share/zoneinfo/posix . +Each contain a complete set of zone files. +See +.Pn /usr/src/share/zoneinfo/Makefile +for details. +.Sh 4 "Additions and changes to the libraries" +.PP +Notable additions to the libraries include functions to traverse a +filesystem hierarchy, database interfaces to btree and hashing functions, +a new, faster implementation of stdio and a radix and merge sort +functions. +.PP +The +.Xr fts (3) +functions will do either physical or logical traversal of +a file hierarchy as well as handle essentially infinite depth +filesystems and filesystems with cycles. +All the utilities in \*(4B which traverse file hierarchies +have been converted to use +.Xr fts (3). +The conversion has always resulted in a significant performance +gain, often of four or five to one in system time. +.PP +The +.Xr dbopen (3) +functions are intended to be a family of database access methods. +Currently, they consist of +.Xr hash (3), +an extensible, dynamic hashing scheme, +.Xr btree (3), +a sorted, balanced tree structure (B+tree's), and +.Xr recno (3), +a flat-file interface for fixed or variable length records +referenced by logical record number. +Each of the access methods stores associated key/data pairs and +uses the same record oriented interface for access. +.PP +The +.Xr qsort (3) +function has been rewritten for additional performance. +In addition, three new types of sorting functions, +.Xr heapsort (3), +.Xr mergesort (3) +and +.Xr radixsort (3) +have been added to the system. +The +.Xr mergesort +function is optimized for data with pre-existing order, +in which case it usually significantly outperforms +.Xr qsort . +The +.Xr radixsort (3) +functions are variants of most-significant-byte radix sorting. +They take time linear to the number of bytes to be +sorted, usually significantly outperforming +.Xr qsort +on data that can be sorted in this fashion. +An implementation of the POSIX 1003.2 standard +.Xr sort (1), +based on +.Xr radixsort , +is included in +.Pn /usr/src/contrib/sort . +.PP +Some additional comments about the \*(4B C library: +.IP \(bu +The floating point support in the C library has been replaced +and is now accurate. +.IP \(bu +The C functions specified by both ANSI C, POSIX 1003.1 and +1003.2 are now part of the C library. +This includes support for file name matching, shell globbing +and both basic and extended regular expressions. +.IP \(bu +ANSI C multibyte and wide character support has been integrated. +The rune functionality from the Bell Labs' Plan 9 system is provided +as well. +.IP \(bu +The +.Xr termcap (3) +functions have been generalized and replaced with a general +purpose interface named +.Xr getcap (3). +.IP \(bu +The +.Xr stdio (3) +routines have been replaced, and are usually much faster. +In addition, the +.Xr funopen (3) +interface permits applications to provide their own I/O stream +function support. +.PP +The +.Xr curses (3) +library has been largely rewritten. +Important additional features include support for scrolling and +.Xr termios (3). +.PP +An application front-end editing library, named libedit, has been +added to the system. +.PP +A superset implementation of the SunOS kernel memory interface library, +libkvm, has been integrated into the system. +.PP +.Sh 4 "Additions and changes to other utilities" +.PP +There are many new utilities, offering many new capabilities, +in \*(4B. +Skimming through the section 1 and section 8 manual pages is sure +to be useful. +The additions to the utility suite include greatly enhanced versions of +programs that display system status information, implementations of +various traditional tools described in the IEEE Std1003.2 standard, +new tools not previous available on Berkeley UNIX systems, +and many others. +Also, with only a very few exceptions, all the utilities from +\*(Ps that included proprietary source code have been replaced, +and their \*(4B counterparts are freely redistributable. +Normally, this replacement resulted in significant performance +improvements and the increase of the limits imposed on data by +the utility as well. +.PP +A summary of specific additions and changes are as follows: +.TS +lfC l. +amd An auto-mounter implementation. +ar Replacement of the historic archive format with a new one. +awk Replaced by gawk; see /usr/src/old/awk for the historic version. +bdes Utility implementing DES modes of operation described in FIPS PUB 81. +calendar Addition of an interface for system calendars. +cap_mkdb Utility for building hashed versions of termcap style databases. +cc Replacement of pcc with gcc suite. +chflags A utility for setting the per-file user and system flags. +chfn An editor based replacement for changing user information. +chpass An editor based replacement for changing user information. +chsh An editor based replacement for changing user information. +cksum The POSIX 1003.2 checksum utility; compatible with sum. +column A columnar text formatting utility. +cp POSIX 1003.2 compatible, able to copy special files. +csh Freely redistributable and 8-bit clean. +date User specified formats added. +dd New EBCDIC conversion tables, major performance improvements. +dev_mkdb Hashed interface to devices. +dm Dungeon master. +find Several new options and primaries, major performance improvements. +fstat Utility displaying information on files open on the system. +ftpd Connection logging added. +hexdump A binary dump utility, superseding od. +id The POSIX 1003.2 user identification utility. +inetd Tcpmux added. +jot A text formatting utility. +kdump A system-call tracing facility. +ktrace A system-call tracing facility. +kvm_mkdb Hashed interface to the kernel name list. +lam A text formatting utility. +lex A new, freely redistributable, significantly faster version. +locate A database of the system files, by name, constructed weekly. +logname The POSIX 1003.2 user identification utility. +mail.local New local mail delivery agent, replacing mail. +make Replaced with a new, more powerful make, supporting include files. +man Added support for man page location configuration. +mkdep A new utility for generating make dependency lists. +mkfifo The POSIX 1003.2 FIFO creation utility. +mtree A new utility for mapping file hierarchies to a file. +nfsstat An NFS statistics utility. +nvi A freely redistributable replacement for the ex/vi editors. +pax The POSIX 1003.2 replacement for cpio and tar. +printf The POSIX 1003.2 replacement for echo. +roff Replaced by groff; see /usr/src/old/roff for the historic versions. +rs New utility for text formatting. +shar An archive building utility. +sysctl MIB-style interface to system state. +tcopy Fast tape-to-tape copying and verification. +touch Time and file reference specifications. +tput The POSIX 1003.2 terminal display utility. +tr Addition of character classes. +uname The POSIX 1003.2 system identification utility. +vis A filter for converting and displaying non-printable characters. +xargs The POSIX 1003.2 argument list constructor utility. +yacc A new, freely redistributable, significantly faster version. +.TE +.PP +The new versions of +.Xr lex (1) +(``flex'') and +.Xr yacc (1) +(``zoo'') should be installed early on if attempting to +cross-compile \*(4B on another system. +Note that the new +.Xr lex +program is not completely backward compatible with historic versions of +.Xr lex , +although it is believed that all documented features are supported. +.PP +The +.Xr find +utility has two new options that are important to be aware of if you +intend to use NFS. +The ``fstype'' and ``prune'' options can be used together to prevent +find from crossing NFS mount points. +See +.Pn /etc/daily +for an example of their use. +.Sh 2 "Hints on converting from \*(Ps to \*(4B" +.PP +This section summarizes changes between +\*(Ps and \*(4B that are likely to +cause difficulty in doing the conversion. +It does not include changes in the network; +see section 5 for information on setting up the network. +.PP +Since the stat st_size field is now 64-bits instead of 32, +doing something like: +.DS +.ft CW +foo(st.st_size); +.DE +and then (improperly) defining foo with an ``int'' or ``long'' parameter: +.DS +.ft CW +foo(size) + int size; +{ + ... +} +.DE +will fail miserably (well, it might work on a little endian machine). +This problem showed up in +.Xr emacs (1) +as well as several other programs. +A related problem is improperly casting (or failing to cast) +the second argument to +.Xr lseek (2), +.Xr truncate (2), +or +.Xr ftruncate (2) +ala: +.DS +.ft CW +lseek(fd, (long)off, 0); +.DE +or +.DS +.ft CW +lseek(fd, 0, 0); +.DE +The best solution is to include +.Pn <unistd.h> +which has prototypes that catch these types of errors. +.PP +Determining the ``namelen'' parameter for a +.Xr connect (2) +call on a unix domain socket should use the ``SUN_LEN'' macro from +.Pn <sys/un.h> . +One old way that was used: +.DS +.ft CW +addrlen = strlen(unaddr.sun_path) + sizeof(unaddr.sun_family); +.DE +no longer works as there is an additional +.Pn sun_len +field. +.PP +The kernel's limit on the number of open files has been +increased from 20 to 64. +It is now possible to change this limit almost arbitrarily. +The standard I/O library +autoconfigures to the kernel limit. +Note that file (``_iob'') entries may be allocated by +.Xr malloc +from +.Xr fopen ; +this allocation has been known to cause problems with programs +that use their own memory allocators. +Memory allocation does not occur until after 20 files have been opened +by the standard I/O library. +.PP +.Xr Select +can be used with more than 32 descriptors +by using arrays of \fBint\fPs for the bit fields rather than single \fBint\fPs. +Programs that used +.Xr getdtablesize +as their first argument to +.Xr select +will no longer work correctly. +Usually the program can be modified to correctly specify the number +of bits in an \fBint\fP. +Alternatively the program can be modified to use an array of \fBint\fPs. +There are a set of macros available in +.Pn <sys/types.h> +to simplify this. +See +.Xr select (2). +.PP +Old core files will not be intelligible by the current debuggers +because of numerous changes to the user structure +and because the kernel stack has been enlarged. +The +.Xr a.out +header that was in the user structure is no longer present. +Locally-written debuggers that try to check the magic number +will need to be changed. +.PP +Files may not be deleted from directories having the ``sticky'' (ISVTX) bit +set in their modes +except by the owner of the file or of the directory, or by the superuser. +This is primarily to protect users' files in publicly-writable directories +such as +.Pn /tmp +and +.Pn /var/tmp . +All publicly-writable directories should have their ``sticky'' bits set +with ``chmod +t.'' +.PP +The following two sections contain additional notes about +changes in \*(4B that affect the installation of local files; +be sure to read them as well. diff --git a/share/doc/smm/01.setup/4.t b/share/doc/smm/01.setup/4.t new file mode 100644 index 00000000000..3841857c493 --- /dev/null +++ b/share/doc/smm/01.setup/4.t @@ -0,0 +1,713 @@ +.\" Copyright (c) 1980, 1986, 1988 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)4.t 8.1 (Berkeley) 7/29/93 +.\" +.ds LH "Installing/Operating \*(4B +.ds CF \*(Dy +.ds RH "System setup +.Sh 1 "System setup" +.PP +This section describes procedures used to set up a \*(4B UNIX system. +These procedures are used when a system is first installed +or when the system configuration changes. Procedures for normal +system operation are described in the next section. +.Sh 2 "Kernel configuration" +.PP +This section briefly describes the layout of the kernel code and +how files for devices are made. +For a full discussion of configuring +and building system images, consult the document ``Building +4.3BSD UNIX Systems with Config'' (SMM:2). +.Sh 3 "Kernel organization" +.PP +As distributed, the kernel source is in a +separate tar image. The source may be physically +located anywhere within any filesystem so long as +a symbolic link to the location is created for the file +.Pn /sys +(many files in +.Pn /usr/include +are normally symbolic links relative to +.Pn /sys ). +In further discussions of the system source all path names +will be given relative to +.Pn /sys . +.LP +The kernel is made up of several large generic parts: +.TS +l l l. +sys main kernel header files +kern kernel functions broken down as follows + init system startup, syscall dispatching, entry points + kern scheduling, descriptor handling and generic I/O + sys process management, signals + tty terminal handling and job control + vfs filesystem management + uipc interprocess communication (sockets) + subr miscellaneous support routines +vm virtual memory management +ufs local filesystems broken down as follows + ufs common local filesystem routines + ffs fast filesystem + lfs log-based filesystem + mfs memory based filesystem +nfs Sun-compatible network filesystem +miscfs miscellaneous filesystems broken down as follows + deadfs where rejected vnodes go to die + fdesc access to per-process file descriptors + fifofs IEEE Std1003.1 FIFOs + kernfs filesystem access to kernel data structures + lofs loopback filesystem + nullfs another loopback filesystem + portal associate processes with filesystem locations + specfs device special files + umapfs provide alternate uid/gid mappings +dev generic device drivers (SCSI, vnode, concatenated disk) +.TE +.LP +The networking code is organized by protocol +.TS +l l. +net routing and generic interface drivers +netinet Internet protocols (TCP, UDP, IP, etc) +netiso ISO protocols (TP-4, CLNP, CLTP, etc) +netns Xerox network systems protocols (IDP, SPP, etc) +netx25 CCITT X.25 protocols (X.25 Packet Level, HDLC/LAPB) +.TE +.LP +A separate subdirectory is provided for each machine architecture +.TS +l l. +hp300 HP 9000/300 series of Motorola 68000-based machines +hp code common to both HP 68k and (non-existent) PA-RISC ports +i386 Intel 386/486-based PC machines +luna68k Omron 68000-based workstations +news3400 Sony News MIPS-based workstations +pmax Digital 3100/5000 MIPS-based workstations +sparc Sun Microsystems SPARCstation 1, 1+, and 2 +tahoe (deprecated) CCI Power 6-series machines +vax (deprecated) Digital VAX machines +.TE +.LP +Each machine directory is subdivided by function; +for example the hp300 directory contains +.TS +l l. +include exported machine-dependent header files +hp300 machine-dependent support code and private header files +dev device drivers +conf configuration files +stand machine-dependent standalone code +.TE +.LP +Other kernel related directories +.TS +l l. +compile area to compile kernels +conf machine-independent configuration files +stand machine-independent standalone code +.TE +.Sh 3 "Devices and device drivers" +.PP +Devices supported by UNIX are implemented in the kernel +by drivers whose source is kept in +.Pn /sys/<architecture>/dev . +These drivers are loaded +into the system when included in a cpu specific configuration file +kept in the conf directory. Devices are accessed through special +files in the filesystem, made by the +.Xr mknod (8) +program and normally kept in the +.Pn /dev +directory. +For all the devices supported by the distribution system, the +files in +.Pn /dev +are created by the +.Pn /dev/MAKEDEV +shell script. +.PP +Determine the set of devices that you have and create a new +.Pn /dev +directory by running the MAKEDEV script. +First create a new directory +.Pn /newdev , +copy MAKEDEV into it, edit the file MAKEDEV.local +to provide an entry for local needs, +and run it to generate a +.Pn /newdev directory. +For instance, +.DS +\fB#\fP \fIcd /\fP +\fB#\fP \fImkdir newdev\fP +\fB#\fP \fIcp dev/MAKEDEV newdev/MAKEDEV\fP +\fB#\fP \fIcd newdev\fP +\fB#\fP \fIMAKEDEV \*(Dk0 pt0 std LOCAL\fP +.DE +Note the ``std'' argument causes standard devices such as +.Pn /dev/console , +the machine console, to be created. +.PP +You can then do +.DS +\fB#\fP \fIcd /\fP +\fB#\fP \fImv dev olddev ; mv newdev dev\fP +\fB#\fP \fIsync\fP +.DE +to install the new device directory. +.Sh 3 "Building new system images" +.PP +The kernel configuration of each UNIX system is described by +a single configuration file, stored in the +.Pn /sys/<architecture>/conf +directory. +To learn about the format of this file and the procedure used +to build system images, +start by reading ``Building 4.3BSD UNIX Systems with Config'' (SMM:2), +look at the manual pages in section 4 +of the UNIX manual for the devices you have, +and look at the sample configuration files in the +.Pn /sys/<architecture>/conf +directory. +.PP +The configured system image +.Pn vmunix +should be copied to the root, and then booted to try it out. +It is best to name it +.Pn /newvmunix +so as not to destroy the working system until you are sure it does work: +.DS +\fB#\fP \fIcp vmunix /newvmunix\fP +\fB#\fP \fIsync\fP +.DE +It is also a good idea to keep the previous system around under some other +name. In particular, we recommend that you save the generic distribution +version of the system permanently as +.Pn /genvmunix +for use in emergencies. +To boot the new version of the system you should follow the +bootstrap procedures outlined in section 6.1. +After having booted and tested the new system, it should be installed as +.Pn /vmunix +before going into multiuser operation. +A systematic scheme for numbering and saving old versions +of the system may be useful. +.Sh 2 "Configuring terminals" +.PP +If UNIX is to support simultaneous +access from directly-connected terminals other than the console, +the file +.Pn /etc/ttys +(see +.Xr ttys (5)) +must be edited. +.PP +To add a new terminal device, be sure the device is configured into the system +and that the special files for the device have been made by +.Pn /dev/MAKEDEV . +Then, enable the appropriate lines of +.Pn /etc/ttys +by setting the ``status'' +field to \fBon\fP (or add new lines). +Note that lines in +.Pn /etc/ttys +are one-for-one with entries in the file of current users +(see +.Pn /var/run/utmp ), +and therefore it is best to make changes +while running in single-user mode +and to add all the entries for a new device at once. +.PP +Each line in the +.Pn /etc/ttys +file is broken into four tab separated +fields (comments are shown by a `#' character and extend to +the end of the line). For each terminal line the four fields +are: +the device (without a leading +.Pn /dev ), +the program +.Pn /sbin/init +should startup to service the line +(or \fBnone\fP if the line is to be left alone), +the terminal type (found in +.Pn /usr/share/misc/termcap ), +and optional status information describing if the terminal is +enabled or not and if it is ``secure'' (i.e. the super user should +be allowed to login on the line). +If the console is marked as ``insecure'', +then the root password is required to bring the machine up single-user. +All fields are character strings +with entries requiring embedded white space enclosed in double +quotes. +Thus a newly added terminal +.Pn /dev/tty00 +could be added as +.DS +tty00 "/usr/libexec/getty std.9600" vt100 on secure # mike's office +.DE +The std.9600 parameter provided to +.Pn /usr/libexec/getty +is used in searching the file +.Pn /etc/gettytab ; +it specifies a terminal's characteristics (such as baud rate). +To make custom terminal types, consult +.Xr gettytab (5) +before modifying +.Pn /etc/gettytab . +.PP +Dialup terminals should be wired so that carrier is asserted only when the +phone line is dialed up. +For non-dialup terminals, from which modem control is not available, +you must wire back the signals so that +the carrier appears to always be present. For further details, +find your terminal driver in section 4 of the manual. +.PP +For network terminals (i.e. pseudo terminals), no program should +be started up on the lines. Thus, the normal entry in +.Pn /etc/ttys +would look like +.DS +ttyp0 none network +.DE +(Note, the fourth field is not needed here.) +.PP +When the system is running multi-user, all terminals that are listed in +.Pn /etc/ttys +as \fBon\fP have their line enabled. +If, during normal operations, you wish +to disable a terminal line, you can edit the file +.Pn /etc/ttys +to change the terminal's status to \fBoff\fP and +then send a hangup signal to the +.Xr init +process, by doing +.DS +\fB#\fP \fIkill \-s HUP 1\fP +.DE +Terminals can similarly be enabled by changing the status field +from \fBoff\fP to \fBon\fP and sending a hangup signal to +.Xr init . +.PP +Note that if a special file is inaccessible when +.Xr init +tries to create a process for it, +.Xr init +will log a message to the +system error logging process (see +.Xr syslogd (8)) +and try to reopen the terminal every minute, reprinting the warning +message every 10 minutes. Messages of this sort are normally +printed on the console, though other actions may occur depending +on the configuration information found in +.Pn /etc/syslog.conf . +.PP +Finally note that you should change the names of any dialup +terminals to ttyd? +where ? is in [0-9a-zA-Z], as some programs use this property of the +names to determine if a terminal is a dialup. +Shell commands to do this should be put in the +.Pn /dev/MAKEDEV.local +script. +.PP +While it is possible to use truly arbitrary strings for terminal names, +the accounting and noticeably the +.Xr ps (1) +command make good use of the convention that tty names +(by default, and also after dialups are named as suggested above) +are distinct in the last 2 characters. +Change this and you may be sorry later, as the heuristic +.Xr ps (1) +uses based on these conventions will then break down and +.Xr ps +will run MUCH slower. +.Sh 2 "Adding users" +.PP +The procedure for adding a new user is described in +.Xr adduser (8). +You should add accounts for the initial user community, giving +each a directory and a password, and putting users who will wish +to share software in the same groups. +.PP +Several guest accounts have been provided on the distribution +system; these accounts are for people at Berkeley, +Bell Laboratories, and others +who have done major work on UNIX in the past. You can delete these accounts, +or leave them on the system if you expect that these people would have +occasion to login as guests on your system. +.Sh 2 "Site tailoring" +.PP +All programs that require the site's name, or some similar +characteristic, obtain the information through system calls +or from files located in +.Pn /etc . +Aside from parts of the +system related to the network, to tailor the system to your +site you must simply select a site name, then edit the file +.DS +/etc/netstart +.DE +The first lines in +.Pn /etc/netstart +use a variable to set the hostname, +.DS +hostname=\fImysitename\fP +/bin/hostname $hostname +.DE +to define the value returned by the +.Xr gethostname (2) +system call. If you are running the name server, your site +name should be your fully qualified domain name. Programs such as +.Xr getty (8), +.Xr mail (1), +.Xr wall (1), +and +.Xr uucp (1) +use this system call so that the binary images are site +independent. +.PP +You will also need to edit +.Pn /etc/netstart +to do the network interface initialization using +.Xr ifconfig (8). +If you are not sure how to do this, see sections 5.1, 5.2, and 5.3. +If you are not running a routing daemon and have +more than one Ethernet in your environment +you will need to set up a default route; +see section 5.4 for details. +Before bringing your system up multiuser, +you should ensure that the networking is properly configured. +The network is started by running +.Pn /etc/netstart . +Once started, you should test connectivity using +.Xr ping (8). +You should first test connectivity to yourself, +then another host on your Ethernet, +and finally a host on another Ethernet. +The +.Xr netstat (8) +program can be used to inspect and debug +your routes; see section 5.4. +.Sh 2 "Setting up the line printer system" +.PP +The line printer system consists of at least +the following files and commands: +.DS +.TS +l l. +/usr/bin/lpq spooling queue examination program +/usr/bin/lprm program to delete jobs from a queue +/usr/bin/lpr program to enter a job in a printer queue +/etc/printcap printer configuration and capability database +/usr/sbin/lpd line printer daemon, scans spooling queues +/usr/sbin/lpc line printer control program +/etc/hosts.lpd list of host allowed to use the printers +.TE +.DE +.PP +The file +.Pn /etc/printcap +is a master database describing line +printers directly attached to a machine and, also, printers +accessible across a network. The manual page +.Xr printcap (5) +describes the format of this database and also +shows the default values for such things as the directory +in which spooling is performed. The line printer system handles +multiple printers, multiple spooling queues, local and remote +printers, and also printers attached via serial lines that require +line initialization such as the baud rate. Raster output devices +such as a Varian or Versatec, and laser printers such as an Imagen, +are also supported by the line printer system. +.PP +Remote spooling via the network is handled with two spooling +queues, one on the local machine and one on the remote machine. +When a remote printer job is started with +.Xr lpr , +the job is queued locally and a daemon process created to oversee the +transfer of the job to the remote machine. If the destination +machine is unreachable, the job will remain queued until it is +possible to transfer the files to the spooling queue on the +remote machine. The +.Xr lpq +program shows the contents of spool +queues on both the local and remote machines. +.PP +To configure your line printers, consult the printcap manual page +and the accompanying document, ``4.3BSD Line Printer Spooler Manual'' (SMM:7). +A call to the +.Xr lpd +program should be present in +.Pn /etc/rc . +.Sh 2 "Setting up the mail system" +.PP +The mail system consists of the following commands: +.DS +.TS +l l. +/usr/bin/mail UCB mail program, described in \fImail\fP\|(1) +/usr/sbin/sendmail mail routing program +/var/spool/mail mail spooling directory +/var/spool/secretmail secure mail directory +/usr/bin/xsend secure mail sender +/usr/bin/xget secure mail receiver +/etc/aliases mail forwarding information +/usr/bin/newaliases command to rebuild binary forwarding database +/usr/bin/biff mail notification enabler +/usr/libexec/comsat mail notification daemon +.TE +.DE +Mail is normally sent and received using the +.Xr mail (1) +command (found in +.Pn /usr/bin/mail ), +which provides a front-end to edit the messages sent +and received, and passes the messages to +.Xr sendmail (8) +for routing. +The routing algorithm uses knowledge of the network name syntax, +aliasing and forwarding information, and network topology, as +defined in the configuration file +.Pn /usr/lib/sendmail.cf , +to process each piece of mail. +Local mail is delivered by giving it to the program +.Pn /usr/libexec/mail.local +that adds it to the mailboxes in the directory +.Pn /var/spool/mail/<username> , +using a locking protocol to avoid problems with simultaneous updates. +After the mail is delivered, the local mail delivery daemon +.Pn /usr/libexec/comsat +is notified, which in turn notifies users who have issued a +``\fIbiff\fP y'' command that mail has arrived. +.PP +Mail queued in the directory +.Pn /var/spool/mail +is normally readable only by the recipient. +To send mail that is secure against perusal +(except by a code-breaker) you should use the secret mail facility, +which encrypts the mail. +.PP +To set up the mail facility you should read the instructions in the +file READ_ME in the directory +.Pn /usr/src/usr.sbin/sendmail +and then adjust the necessary configuration files. +You should also set up the file +.Pn /etc/aliases +for your installation, creating mail groups as appropriate. +For more informations see +``Sendmail Installation and Operation Guide'' (SMM:8) and +``Sendmail \- An Internetwork Mail Router'' (SMM:9). +.Sh 3 "Setting up a UUCP connection" +.LP +The version of +.Xr uucp +included in \*(4B has the following features: +.IP \(bu 3 +support for many auto call units and dialers +in addition to the DEC DN11, +.IP \(bu 3 +breakup of the spooling area into multiple subdirectories, +.IP \(bu 3 +addition of an +.Pn L.cmds +file to control the set +of commands that may be executed by a remote site, +.IP \(bu 3 +enhanced ``expect-send'' sequence capabilities when +logging in to a remote site, +.IP \(bu 3 +new commands to be used in polling sites and +obtaining snap shots of +.Xr uucp +activity, +.IP \(bu 3 +additional protocols for different communication media. +.LP +This section gives a brief overview of +.Xr uucp +and points out the most important steps in its installation. +.PP +To connect two UNIX machines with a +.Xr uucp +network link using modems, +one site must have an automatic call unit +and the other must have a dialup port. +It is better if both sites have both. +.PP +You should first read the paper in the UNIX System Manager's Manual: +``Uucp Implementation Description'' (SMM:14). +It describes in detail the file formats and conventions, +and will give you a little context. +In addition, +the document ``setup.tblms'', +located in the directory +.Pn /usr/src/usr.bin/uucp/UUAIDS , +may be of use in tailoring the software to your needs. +.PP +The +.Xr uucp +support is located in three major directories: +.Pn /usr/bin, +.Pn /usr/lib/uucp, +and +.Pn /var/spool/uucp . +User commands are kept in +.Pn /usr/bin, +operational commands in +.Pn /usr/lib/uucp , +and +.Pn /var/spool/uucp +is used as a spooling area. +The commands in +.Pn /usr/bin +are: +.DS +.TS +l l. +/usr/bin/uucp file-copy command +/usr/bin/uux remote execution command +/usr/bin/uusend binary file transfer using mail +/usr/bin/uuencode binary file encoder (for \fIuusend\fP) +/usr/bin/uudecode binary file decoder (for \fIuusend\fP) +/usr/bin/uulog scans session log files +/usr/bin/uusnap gives a snap-shot of \fIuucp\fP activity +/usr/bin/uupoll polls remote system until an answer is received +/usr/bin/uuname prints a list of known uucp hosts +/usr/bin/uuq gives information about the queue +.TE +.DE +The important files and commands in +.Pn /usr/lib/uucp +are: +.DS +.TS +l l. +/usr/lib/uucp/L-devices list of dialers and hard-wired lines +/usr/lib/uucp/L-dialcodes dialcode abbreviations +/usr/lib/uucp/L.aliases hostname aliases +/usr/lib/uucp/L.cmds commands remote sites may execute +/usr/lib/uucp/L.sys systems to communicate with, how to connect, and when +/usr/lib/uucp/SEQF sequence numbering control file +/usr/lib/uucp/USERFILE remote site pathname access specifications +/usr/lib/uucp/uucico \fIuucp\fP protocol daemon +/usr/lib/uucp/uuclean cleans up garbage files in spool area +/usr/lib/uucp/uuxqt \fIuucp\fP remote execution server +.TE +.DE +while the spooling area contains the following important files and directories: +.DS +.TS +l l. +/var/spool/uucp/C. directory for command, ``C.'' files +/var/spool/uucp/D. directory for data, ``D.'', files +/var/spool/uucp/X. directory for command execution, ``X.'', files +/var/spool/uucp/D.\fImachine\fP directory for local ``D.'' files +/var/spool/uucp/D.\fImachine\fPX directory for local ``X.'' files +/var/spool/uucp/TM. directory for temporary, ``TM.'', files +/var/spool/uucp/LOGFILE log file of \fIuucp\fP activity +/var/spool/uucp/SYSLOG log file of \fIuucp\fP file transfers +.TE +.DE +.PP +To install +.Xr uucp +on your system, +start by selecting a site name +(shorter than 14 characters). +A +.Xr uucp +account must be created in the password file and a password set up. +Then, +create the appropriate spooling directories with mode 755 +and owned by user +.Xr uucp , +group \fIdaemon\fP. +.PP +If you have an auto-call unit, +the L.sys, L-dialcodes, and L-devices files should be created. +The L.sys file should contain +the phone numbers and login sequences +required to establish a connection with a +.Xr uucp +daemon on another machine. +For example, our L.sys file looks something like: +.DS +adiron Any ACU 1200 out0123456789- ogin-EOT-ogin uucp +cbosg Never Slave 300 +cbosgd Never Slave 300 +chico Never Slave 1200 out2010123456 +.DE +The first field is the name of a site, +the second shows when the machine may be called, +the third field specifies how the host is connected +(through an ACU, a hard-wired line, etc.), +then comes the phone number to use in connecting through an auto-call unit, +and finally a login sequence. +The phone number +may contain common abbreviations that are defined in the L-dialcodes file. +The device specification should refer to devices +specified in the L-devices file. +Listing only ACU causes the +.Xr uucp +daemon, +.Xr uucico , +to search for any available auto-call unit in L-devices. +Our L-dialcodes file is of the form: +.DS +ucb 2 +out 9% +.DE +while our L-devices file is: +.DS +ACU cul0 unused 1200 ventel +.DE +Refer to the README file in the +.Xr uucp +source directory for more information about installation. +.PP +As +.Xr uucp +operates it creates (and removes) many small +files in the directories underneath +.Pn /var/spool/uucp . +Sometimes files are left undeleted; +these are most easily purged with the +.Xr uuclean +program. +The log files can grow without bound unless trimmed back; +.Xr uulog +maintains these files. +Many useful aids in maintaining your +.Xr uucp +installation are included in a subdirectory UUAIDS beneath +.Pn /usr/src/usr.bin/uucp . +Peruse this directory and read the ``setup'' instructions also located there. diff --git a/share/doc/smm/01.setup/5.t b/share/doc/smm/01.setup/5.t new file mode 100644 index 00000000000..10b86dd3bda --- /dev/null +++ b/share/doc/smm/01.setup/5.t @@ -0,0 +1,586 @@ +.\" Copyright (c) 1980, 1986, 1988, 1993 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)5.t 8.1 (Berkeley) 7/27/93 +.\" +.ds lq `` +.ds rq '' +.ds LH "Installing/Operating \*(4B +.ds RH Network setup +.ds CF \*(Dy +.Sh 1 "Network setup" +.PP +\*(4B provides support for the standard Internet +protocols IP, ICMP, TCP, and UDP. These protocols may be used +on top of a variety of hardware devices ranging from +serial lines to local area network controllers +for the Ethernet. Network services are split between the +kernel (communication protocols) and user programs (user +services such as TELNET and FTP). This section describes +how to configure your system to use the Internet networking support. +\*(4B also supports the Xerox Network Systems (NS) protocols. +IDP and SPP are implemented in the kernel, +and other protocols such as Courier run at the user level. +\*(4B provides some support for the ISO OSI protocols CLNP +TP4, and ESIS. User level process +complete the application protocols such as X.400 and X.500. +.Sh 2 "System configuration" +.PP +To configure the kernel to include the Internet communication +protocols, define the INET option. +Xerox NS support is enabled with the NS option. +ISO OSI support is enabled with the ISO option. +In either case, include the pseudo-devices +``pty'', and ``loop'' in your machine's configuration +file. +The ``pty'' pseudo-device forces the pseudo terminal device driver +to be configured into the system, see +.Xr pty (4), +while the ``loop'' pseudo-device forces inclusion of the software loopback +interface driver. +The loop driver is used in network testing +and also by the error logging system. +.PP +If you are planning to use the Internet network facilities on a 10Mb/s +Ethernet, the pseudo-device ``ether'' should also be included +in the configuration; this forces inclusion of the Address Resolution +Protocol module used in mapping between 48-bit Ethernet +and 32-bit Internet addresses. +.PP +Before configuring the appropriate networking hardware, you should +consult the manual pages in section 4 of the Programmer's Manual +selecting the appropriate interfaces for your architecture. +.PP +All network interface drivers including the loopback interface, +require that their host address(es) be defined at boot time. +This is done with +.Xr ifconfig (8) +commands included in the +.Pn /etc/netstart +file. +Interfaces that are able to dynamically deduce the host +part of an address may check that the host part of the address is correct. +The manual page for each network interface +describes the method used to establish a host's address. +.Xr Ifconfig (8) +can also be used to set options for the interface at boot time. +Options are set independently for each interface, and +apply to all packets sent using that interface. +Alternatively, translations for such hosts may be set in advance +or ``published'' by a \*(4B host by use of the +.Xr arp (8) +command. +Note that the use of trailer link-level is now negotiated between \*(4B hosts +using ARP, +and it is thus no longer necessary to disable the use of trailers +with +.Xr ifconfig . +.PP +The OSI equivalent to ARP is ESIS (End System to Intermediate System Routing +Protocol); running this protocol is mandatory, however one can manually add +translations for machines that do not participate by use of the +.Xr route (8) +command. +Additional information is provided in the manual page describing +.Xr ESIS (4). +.PP +To use the pseudo terminals just configured, device +entries must be created in the +.Pn /dev +directory. To create 32 +pseudo terminals (plenty, unless you have a heavy network load) +execute the following commands. +.DS +\fB#\fP \fIcd /dev\fP +\fB#\fP \fIMAKEDEV pty0 pty1\fP +.DE +More pseudo terminals may be made by specifying +.Pn pty2 , +.Pn pty3 , +etc. The kernel normally includes support for 32 pseudo terminals +unless the configuration file specifies a different number. +Each pseudo terminal really consists of two files in +.Pn /dev : +a master and a slave. The master pseudo terminal file is named +.Pn /dev/ptyp? , +while the slave side is +.Pn /dev/ttyp? . +Pseudo terminals are also used by several programs not related to the network. +In addition to creating the pseudo terminals, +be sure to install them in the +.Pn /etc/ttys +file (with a `none' in the second column so no +.Xr getty +is started). +.Sh 2 "Local subnets" +.PP +In \*(4B the Internet support +includes the notion of ``subnets''. This is a mechanism +by which multiple local networks may appears as a single Internet +network to off-site hosts. Subnetworks are useful because +they allow a site to hide their local topology, requiring only a single +route in external gateways; +it also means that local network numbers may be locally administered. +The standard describing this change in Internet addressing is RFC-950. +.PP +To set up local subnets one must first decide how the available +address space (the Internet ``host part'' of the 32-bit address) +is to be partitioned. +Sites with a class A network +number have a 24-bit host address space with which to work, sites with a +class B network number have a 16-bit host address space, while sites with +a class C network number have an 8-bit host address space\**. +.FS +If you are unfamiliar with the Internet addressing structure, consult +``Address Mappings'', Internet RFC-796, J. Postel; available from +the Internet Network Information Center at SRI. +.FE +To define local subnets you must steal some bits +from the local host address space for use in extending the network +portion of the Internet address. This reinterpretation of Internet +addresses is done only for local networks; i.e. it is not visible +to hosts off-site. For example, if your site has a class B network +number, hosts on this network have an Internet address that contains +the network number, 16 bits, and the host number, another +16 bits. To define 254 local subnets, each +possessing at most 255 hosts, 8 bits may be taken from the local part. +(The use of subnets 0 and all-1's, 255 in this example, is discouraged +to avoid confusion about broadcast addresses.) +These new network +numbers are then constructed by concatenating the original 16-bit network +number with the extra 8 bits containing the local subnet number. +.PP +The existence of local subnets is communicated to the system at the time a +network interface is configured with the +.I netmask +option to the +.Xr ifconfig +program. A ``network mask'' is specified to define the +portion of the Internet address that is to be considered the network part +for that network. +This mask normally contains the bits corresponding to the standard +network part as well as the portion of the local part +that has been assigned to subnets. +If no mask is specified when the address is set, +it will be set according to the class of the network. +For example, at Berkeley (class B network 128.32) 8 bits +of the local part have been reserved for defining subnets; +consequently the +.Pn /etc/netstart +file contains lines of the form +.DS +.ft CW +/sbin/ifconfig le0 netmask 0xffffff00 128.32.1.7 +.DE +This specifies that for interface ``le0'', the upper 24 bits of +the Internet address should be used in calculating network numbers +(netmask 0xffffff00), and the interface's Internet address is +``128.32.1.7'' (host 7 on network 128.32.1). Hosts \fIm\fP on +sub-network \fIn\fP of this network would then have addresses of +the form ``128.32.\fIn\fP.\fIm\fP''; for example, host +99 on network 129 would have an address ``128.32.129.99''. +For hosts with multiple interfaces, the network mask should +be set for each interface, +although in practice only the mask of the first interface on each network +is really used. +.Sh 2 "Internet broadcast addresses" +.PP +The address defined as the broadcast address for Internet networks +according to RFC-919 is the address with a host part of all 1's. +The address used by 4.2BSD was the address with a host part of 0. +\*(4B uses the standard broadcast address (all 1's) by default, +but allows the broadcast address to be set (with +.Xr ifconfig ) +for each interface. +This allows networks consisting of both 4.2BSD, \*(Ps and \*(4B hosts +to coexist while the upgrade process proceeds. +In the presence of subnets, the broadcast address uses the subnet field +as for normal host addresses, with the remaining host part set to 1's +(or 0's, on a network that has not yet been converted). +\*(4B hosts recognize and accept packets +sent to the logical-network broadcast address as well as those sent +to the subnet broadcast address, and when using an all-1's broadcast, +also recognize and receive packets sent to host 0 as a broadcast. +.Sh 2 "Routing" +.PP +If your environment allows access to networks not directly +attached to your host you will need to set up routing information +to allow packets to be properly routed. Two schemes are +supported by the system. The first scheme +employs a routing table management daemon. +Optimally, you should use the routing daemon +.Xr gated +available from Cornell university. +We use it on our systems and it works well, +especially for multi-homed hosts using Serial Line IP (SLIP). +Unfortunately, we were not able to obtain permission to +include it on \*(4B. +.PP +If you do not wish to or cannot obtain +.Xr gated , +the distribution does include +.Xr routed (8) +to maintain the system routing tables. The routing daemon +uses a variant of the Xerox Routing Information Protocol +to maintain up to date routing tables in a cluster of local +area networks. By using the +.Pn /etc/gateways +file, the routing daemon can also be used to initialize static routes +to distant networks (see the next section for further discussion). +When the routing daemon is started up +(usually from +.Pn /etc/rc ) +it reads +.Pn /etc/gateways +if it exists and installs those routes defined there, +then broadcasts on each local network +to which the host is attached to find other instances of the routing +daemon. If any responses are received, the routing daemons +cooperate in maintaining a globally consistent view of routing +in the local environment. This view can be extended to include +remote sites also running the routing daemon by setting up suitable +entries in +.Pn /etc/gateways ; +consult +.Xr routed (8) +for a more thorough discussion. +.PP +The second approach is to define a default or wildcard +route to a smart +gateway and depend on the gateway to provide ICMP routing +redirect information to dynamically create a routing data +base. This is done by adding an entry of the form +.DS +.ft CW +/sbin/route add default \fIsmart-gateway\fP 1 +.DE +to +.Pn /etc/netstart ; +see +.Xr route (8) +for more information. The default route +will be used by the system as a ``last resort'' +in routing packets to their destination. Assuming the gateway +to which packets are directed is able to generate the proper +routing redirect messages, the system will then add routing +table entries based on the information supplied. This approach +has certain advantages over the routing daemon, but is +unsuitable in an environment where there are only bridges (i.e. +pseudo gateways that, for instance, do not generate routing +redirect messages). Further, if the +smart gateway goes down there is no alternative, save manual +alteration of the routing table entry, to maintaining service. +.PP +The system always listens, and processes, routing redirect +information, so it is possible to combine both of the above +facilities. For example, the routing table management process +might be used to maintain up to date information about routes +to geographically local networks, while employing the wildcard +routing techniques for ``distant'' networks. The +.Xr netstat (1) +program may be used to display routing table contents as well +as various routing oriented statistics. For example, +.DS +\fB#\fP \fInetstat \-r\fP +.DE +will display the contents of the routing tables, while +.DS +\fB#\fP \fInetstat \-r \-s\fP +.DE +will show the number of routing table entries dynamically +created as a result of routing redirect messages, etc. +.Sh 2 "Use of \*(4B machines as gateways" +.PP +Several changes have been made in \*(4B in the area of gateway support +(or packet forwarding, if one prefers). +A new configuration option, GATEWAY, is used when configuring +a machine to be used as a gateway. +This option increases the size of the routing hash tables in the kernel. +Unless configured with that option, +hosts with only a single non-loopback interface never attempt +to forward packets or to respond with ICMP error messages to misdirected +packets. +This change reduces the problems that may occur when different hosts +on a network disagree on the network number or broadcast address. +Another change is that \*(4B machines that forward packets back through +the same interface on which they arrived +will send ICMP redirects to the source host if it is on the same network. +This improves the interaction of \*(4B gateways with hosts that configure +their routes via default gateways and redirects. +The generation of redirects may be disabled with the configuration option +IPSENDREDIRECTS=0 or while the system is running by using the command: +.DS +.ft CW +sysctl -w net.inet.ip.redirect=0 +.DE +in environments where it may cause difficulties. +.Sh 2 "Network databases" +.PP +Several data files are used by the network library routines +and server programs. Most of these files are host independent +and updated only rarely. +.br +.ne 1i +.TS +lfC l l. +File Manual reference Use +_ +/etc/hosts \fIhosts\fP\|(5) local host names +/etc/networks \fInetworks\fP\|(5) network names +/etc/services \fIservices\fP\|(5) list of known services +/etc/protocols \fIprotocols\fP\|(5) protocol names +/etc/hosts.equiv \fIrshd\fP\|(8) list of ``trusted'' hosts +/etc/netstart \fIrc\fP\|(8) command script for initializing network +/etc/rc \fIrc\fP\|(8) command script for starting standard servers +/etc/rc.local \fIrc\fP\|(8) command script for starting local servers +/etc/ftpusers \fIftpd\fP\|(8) list of ``unwelcome'' ftp users +/etc/hosts.lpd \fIlpd\fP\|(8) list of hosts allowed to access printers +/etc/inetd.conf \fIinetd\fP\|(8) list of servers started by \fIinetd\fP +.TE +The files distributed are set up for Internet hosts. +Local networks and hosts should be added to describe the local +configuration; the Berkeley entries may serve as examples +(see also the section on +.Pn /etc/hosts ). +Network numbers will have to be chosen for each Ethernet. +For sites connected to the Internet, +the normal channels should be used for allocation of network +numbers (contact hostmaster@SRI-NIC.ARPA). +For other sites, +these could be chosen more or less arbitrarily, +but it is generally better to request official numbers +to avoid conversion if a connection to the Internet (or others on the Internet) +is ever established. +.Sh 3 "Network servers" +.PP +Most network servers are automatically started up at boot time +by the command file +.Pn /etc/rc +or by the Internet daemon (see below). +These include the following: +.TS +lfC l l. +Program Server Started by +_ +/usr/sbin/syslogd error logging server \f(CW/etc/rc\fP +/usr/sbin/named Internet name server \f(CW/etc/rc\fP +/sbin/routed routing table management daemon \f(CW/etc/rc\fP +/usr/sbin/rwhod system status daemon \f(CW/etc/rc\fP +/usr/sbin/timed time synchronization daemon \f(CW/etc/rc\fP +/usr/sbin/sendmail SMTP server \f(CW/etc/rc\fP +/usr/libexec/rshd shell server inetd +/usr/libexec/rexecd exec server inetd +/usr/libexec/rlogind login server inetd +/usr/libexec/telnetd TELNET server inetd +/usr/libexec/ftpd FTP server inetd +/usr/libexec/fingerd Finger server inetd +/usr/libexec/tftpd TFTP server inetd +.TE +Consult the manual pages and accompanying documentation (particularly +for named and sendmail) for details about their operation. +.PP +The use of +.Xr routed +and +.Xr rwhod +is controlled by shell +variables set in +.Pn /etc/netstart . +By default, +.Xr routed +is used, but +.Xr rwhod +is not; they are enabled by setting the variables \fIroutedflags\fP and +.Xr rwhod +to strings other than ``NO.'' +The value of \fIroutedflags\fP provides host-specific options to +.Xr routed . +For example, +.DS +.ft CW +routedflags=-q +rwhod=NO +.DE +would run +.Xr "routed -q" +and would not run +.Xr rwhod . +.PP +To have other network servers started as well, +commands of the following sort should be placed in the site-dependent file +.Pn /etc/rc.local . +.DS +.ft CW +if [ -f /usr/sbin/timed ]; then + /usr/sbin/timed & echo -n ' timed' >/dev/console +f\&i +.DE +.Sh 3 "Internet daemon" +.PP +In \*(4B most of the servers for user-visible services are started up by a +``super server'', the Internet daemon. The Internet +daemon, +.Pn /usr/sbin/inetd , +acts as a master server for +programs specified in its configuration file, +.Pn /etc/inetd.conf , +listening for service requests for these servers, and starting +up the appropriate program whenever a request is received. +The configuration file contains lines containing a service +name (as found in +.Pn /etc/services ), +the type of socket the +server expects (e.g. stream or dgram), the protocol to be +used with the socket (as found in +.Pn /etc/protocols ), +whether to wait for each server to complete before starting up another, +the user name by which the server should run, the server +program's name, and at most five arguments to pass to the +server program. +Some trivial services are implemented internally in +.Xr inetd , +and their servers are listed as ``internal.'' +For example, an entry for the file +transfer protocol server would appear as +.DS +.ft CW +ftp stream tcp nowait root /usr/libexec/ftpd ftpd +.DE +Consult +.Xr inetd (8) +for more detail on the format of the configuration file +and the operation of the Internet daemon. +.Sh 3 "The \f(CW/etc/hosts.equiv\fP file" +.PP +The remote login and shell servers use an +authentication scheme based on trusted hosts. The +.Pn hosts.equiv +file contains a list of hosts that are considered trusted +and, under a single administrative control. When a user +contacts a remote login or shell server requesting service, +the client process passes the user's name and the official +name of the host on which the client is located. In the simple +case, if the host's name is located in +.Pn hosts.equiv +and the user has an account on the server's machine, then service +is rendered (i.e. the user is allowed to log in, or the command +is executed). Users may expand this ``equivalence'' of +machines by installing a +.Pn \&.rhosts +file in their login directory. +The root login is handled specially, bypassing the +.Pn hosts.equiv +file, and using only the +.Pn /.rhosts +file. +.PP +Thus, to create a class of equivalent machines, the +.Pn hosts.equiv +file should contain the \fIofficial\fP names for those machines. +If you are running the name server, you may omit the domain part +of the host name for machines in your local domain. +For example, four machines on our local +network are considered trusted, so the +.Pn hosts.equiv +file is of the form: +.DS +.ft CW +vangogh.CS.Berkeley.EDU +picasso.CS.Berkeley.EDU +okeeffe.CS.Berkeley.EDU +.DE +.Sh 3 "The \f(CW/etc/ftpusers\fP file" +.PP +The FTP server included in the system provides support for an +anonymous FTP account. Because of the inherent security problems +with such a facility you should read this section carefully if +you consider providing such a service. +.PP +An anonymous account is enabled by creating a user +.Xr ftp . +When a client uses the anonymous account a +.Xr chroot (2) +system call is performed by the server to restrict the client +from moving outside that part of the filesystem where the +user ftp home directory is located. Because a +.Xr chroot +call is used, certain programs and files used by the server +process must be placed in the ftp home directory. +Further, one must be +sure that all directories and executable images are unwritable. +The following directory setup is recommended. The +use of the +.Xr awk +commands to copy the +.Pn /etc/passwd +and +.Pn /etc/group +files are \fBSTRONGLY\fP recommended. +.DS +\fB#\fP \fIcd ~ftp\fP +\fB#\fP \fIchmod 555 .; chown ftp .; chgrp ftp .\fP +\fB#\fP \fImkdir bin etc pub\fP +\fB#\fP \fIchown root bin etc\fP +\fB#\fP \fIchmod 555 bin etc\fP +\fB#\fP \fIchown ftp pub\fP +\fB#\fP \fIchmod 777 pub\fP +\fB#\fP \fIcd bin\fP +\fB#\fP \fIcp /bin/sh /bin/ls .\fP +\fB#\fP \fIchmod 111 sh ls\fP +\fB#\fP \fIcd ../etc\fP +\fB#\fP \fIawk -F: '{$2="*";print$1":"$2":"$3":"$4":"$5":"$6":"}' < /etc/passwd > passwd\fP +\fB#\fP \fIawk -F: '{$2="*";print$1":"$2":"}' < /etc/group > group\fP +\fB#\fP \fIchmod 444 passwd group\fP +.DE +When local users wish to place files in the anonymous +area, they must be placed in a subdirectory. In the +setup here, the directory +.Pn ~ftp/pub +is used. +.PP +Aside from the problems of directory modes and such, +the ftp server may provide a loophole for interlopers +if certain user accounts are allowed. +The file +.Pn /etc/ftpusers +is checked on each connection. +If the requested user name is located in the file, the +request for service is denied. This file normally has +the following names on our systems. +.DS +uucp +root +.DE +Accounts without passwords need not be listed in this file as the ftp +server will refuse service to these users. +Accounts with nonstandard shells (any not listed in +.Pn /etc/shells ) +will also be denied access via ftp. diff --git a/share/doc/smm/01.setup/6.t b/share/doc/smm/01.setup/6.t new file mode 100644 index 00000000000..d043474c1fd --- /dev/null +++ b/share/doc/smm/01.setup/6.t @@ -0,0 +1,663 @@ +.\" Copyright (c) 1980, 1986, 1988, 1993 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)6.t 8.1 (Berkeley) 7/27/93 +.\" +.ds LH "Installing/Operating \*(4B +.ds CF \*(Dy +.Sh 1 "System operation" +.PP +This section describes procedures used to operate a \*(4B UNIX system. +Procedures described here are used periodically, to reboot the system, +analyze error messages from devices, do disk backups, monitor +system performance, recompile system software and control local changes. +.Sh 2 "Bootstrap and shutdown procedures" +.PP +In a normal reboot, the system checks the disks and comes up multi-user +without intervention at the console. +Such a reboot +can be stopped (after it prints the date) with a ^C (interrupt). +This will leave the system in single-user mode, with only the console +terminal active. +(If the console has been marked ``insecure'' in +.Pn /etc/ttys +you must enter the root password to bring the machine to single-user mode.) +It is also possible to allow the filesystem checks to complete +and then to return to single-user mode by signaling +.Xr fsck (8) +with a QUIT signal (^\|\e). +.PP +To bring the system up to a multi-user configuration from the single-user +status, +all you have to do is hit ^D on the console. The system +will then execute +.Pn /etc/rc , +a multi-user restart script (and +.Pn /etc/rc.local ), +and come up on the terminals listed as +active in the file +.Pn /etc/ttys . +See +.Xr init (8) +and +.Xr ttys (5) for more details. +Note, however, that this does not cause a filesystem check to be done. +Unless the system was taken down cleanly, you should run +``fsck \-p'' or force a reboot with +.Xr reboot (8) +to have the disks checked. +.PP +To take the system down to a single user state you can use +.DS +\fB#\fP \fIkill 1\fP +.DE +or use the +.Xr shutdown (8) +command (which is much more polite, if there are other users logged in) +when you are running multi-user. +Either command will kill all processes and give you a shell on the console, +as if you had just booted. Filesystems remain mounted after the +system is taken single-user. If you wish to come up multi-user again, you +should do this by: +.DS +\fB#\fP \fIcd /\fP +\fB#\fP \fI/sbin/umount -a\fP +\fB#\fP \fI^D\fP +.DE +.PP +Each system shutdown, crash, processor halt and reboot +is recorded in the system log +with its cause. +.Sh 2 "Device errors and diagnostics" +.PP +When serious errors occur on peripherals or in the system, the system +prints a warning diagnostic on the console. +These messages are collected +by the system error logging process +.Xr syslogd (8) +and written into a system error log file +.Pn /var/log/messages . +Less serious errors are sent directly to +.Xr syslogd , +which may log them on the console. +The error priorities that are logged and the locations to which they are logged +are controlled by +.Pn /etc/syslog.conf . +See +.Xr syslogd (8) +for further details. +.PP +Error messages printed by the devices in the system are described with the +drivers for the devices in section 4 of the programmer's manual. +If errors occur suggesting hardware problems, you should contact +your hardware support group or field service. It is a good idea to +examine the error log file regularly +(e.g. with the command \fItail \-r /var/log/messages\fP). +.Sh 2 "Filesystem checks, backups, and disaster recovery" +.PP +Periodically (say every week or so in the absence of any problems) +and always (usually automatically) after a crash, +all the filesystems should be checked for consistency +by +.Xr fsck (1). +The procedures of +.Xr reboot (8) +should be used to get the system to a state where a filesystem +check can be done manually or automatically. +.PP +Dumping of the filesystems should be done regularly, +since once the system is going it is easy to +become complacent. +Complete and incremental dumps are easily done with +.Xr dump (8). +You should arrange to do a towers-of-hanoi dump sequence; we tune +ours so that almost all files are dumped on two tapes and kept for at +least a week in most every case. We take full dumps every month (and keep +these indefinitely). +Operators can execute ``dump w'' at login that will tell them what needs +to be dumped +(based on the +.Pn /etc/fstab +information). +Be sure to create a group +.B operator +in the file +.Pn /etc/group +so that dump can notify logged-in operators when it needs help. +.PP +More precisely, we have three sets of dump tapes: 10 daily tapes, +5 weekly sets of 2 tapes, and fresh sets of three tapes monthly. +We do daily dumps circularly on the daily tapes with sequence +`3 2 5 4 7 6 9 8 9 9 9 ...'. +Each weekly is a level 1 and the daily dump sequence level +restarts after each weekly dump. +Full dumps are level 0 and the daily sequence restarts after each full dump +also. +.PP +Thus a typical dump sequence would be: +.br +.ne 6 +.TS +center; +c c c c c +n n n l l. +tape name level number date opr size +_ +FULL 0 Nov 24, 1992 operator 137K +D1 3 Nov 28, 1992 operator 29K +D2 2 Nov 29, 1992 operator 34K +D3 5 Nov 30, 1992 operator 19K +D4 4 Dec 1, 1992 operator 22K +W1 1 Dec 2, 1992 operator 40K +D5 3 Dec 4, 1992 operator 15K +D6 2 Dec 5, 1992 operator 25K +D7 5 Dec 6, 1992 operator 15K +D8 4 Dec 7, 1992 operator 19K +W2 1 Dec 9, 1992 operator 118K +D9 3 Dec 11, 1992 operator 15K +D10 2 Dec 12, 1992 operator 26K +D1 5 Dec 15, 1992 operator 14K +W3 1 Dec 17, 1992 operator 71K +D2 3 Dec 18, 1992 operator 13K +FULL 0 Dec 22, 1992 operator 135K +.TE +We do weekly dumps often enough that daily dumps always fit on one tape. +.PP +Dumping of files by name is best done by +.Xr tar (1) +but the amount of data that can be moved in this way is limited +to a single tape. +Finally if there are enough drives entire +disks can be copied with +.Xr dd (1) +using the raw special files and an appropriate +blocking factor; the number of sectors per track is usually +a good value to use, consult +.Pn /etc/disktab . +.PP +It is desirable that full dumps of the root filesystem be +made regularly. +This is especially true when only one disk is available. +Then, if the +root filesystem is damaged by a hardware or software failure, you +can rebuild a workable disk doing a restore in the +same way that the initial root filesystem was created. +.PP +Exhaustion of user-file space is certain to occur +now and then; disk quotas may be imposed, or if you +prefer a less fascist approach, try using the programs +.Xr du (1), +.Xr df (1), +and +.Xr quot (8), +combined with threatening +messages of the day, and personal letters. +.Sh 2 "Moving filesystem data" +.PP +If you have the resources, +the best way to move a filesystem +is to dump it to a spare disk partition, or magtape, using +.Xr dump (8), +use +.Xr newfs (8) +to create the new filesystem, +and restore the filesystem using +.Xr restore (8). +Filesystems may also be moved by piping the output of +.Xr dump +to +.Xr restore . +The +.Xr restore +program uses an ``in-place'' algorithm that +allows filesystem dumps to be restored without concern for the +original size of the filesystem. Further, portions of a +filesystem may be selectively restored using a method similar +to the tape archive program. +.PP +If you have to merge a filesystem into another, existing one, +the best bet is to use +.Xr tar (1). +If you must shrink a filesystem, the best bet is to dump +the original and restore it onto the new filesystem. +If you +are playing with the root filesystem and only have one drive, +the procedure is more complicated. +If the only drive is a Winchester disk, this procedure may not be used +without overwriting the existing root or another partition. +What you do is the following: +.IP 1. +GET A SECOND PACK, OR USE ANOTHER DISK DRIVE!!!! +.IP 2. +Dump the root filesystem to tape using +.Xr dump (8). +.IP 3. +Bring the system down. +.IP 4. +Mount the new pack in the correct disk drive, if +using removable media. +.IP 5. +Load the distribution tape and install the new +root filesystem as you did when first installing the system. +Boot normally +using the newly created disk filesystem. +.PP +Note that if you change the disk partition tables or add new disk +drivers they should also be added to the standalone system in +.Pn /sys/<architecture>/stand , +and the default disk partition tables in +.Pn /etc/disktab +should be modified. +.Sh 2 "Monitoring system performance" +.PP +The +.Xr systat +program provided with the system is designed to be an aid to monitoring +systemwide activity. The default ``pigs'' mode shows a dynamic ``ps''. +By running in the ``vmstat'' mode +when the system is active you can judge the system activity in several +dimensions: job distribution, virtual memory load, paging and swapping +activity, device interrupts, and disk and cpu utilization. +Ideally, there should be few blocked (b) jobs, +there should be little paging or swapping activity, there should +be available bandwidth on the disk devices (most single arms peak +out at 20-30 tps in practice), and the user cpu utilization (us) should +be high (above 50%). +.PP +If the system is busy, then the count of active jobs may be large, +and several of these jobs may often be blocked (b). If the virtual +memory is active, then the paging demon will be running (sr will +be non-zero). It is healthy for the paging demon to free pages when +the virtual memory gets active; it is triggered by the amount of free +memory dropping below a threshold and increases its pace as free memory +goes to zero. +.PP +If you run in the ``vmstat'' mode +when the system is busy, you can find +imbalances by noting abnormal job distributions. If many +processes are blocked (b), then the disk subsystem +is overloaded or imbalanced. If you have several non-dma +devices or open teletype lines that are ``ringing'', or user programs +that are doing high-speed non-buffered input/output, then the system +time may go high (60-70% or higher). +It is often possible to pin down the cause of high system time by +looking to see if there is excessive context switching (cs), interrupt +activity (in) and per-device interrupt counts, +or system call activity (sy). Cumulatively on one of +our large machines we average about 60-200 context switches and interrupts +per second and about 50-500 system calls per second. +.PP +If the system is heavily loaded, or if you have little memory +for your load (2M is little in most any case), then the system +may be forced to swap. This is likely to be accompanied by a noticeable +reduction in system performance and pregnant pauses when interactive +jobs such as editors swap out. +If you expect to be in a memory-poor environment +for an extended period you might consider administratively +limiting system load. +.Sh 2 "Recompiling and reinstalling system software" +.PP +It is easy to regenerate either the entire system or a single utility, +and it is a good idea to try rebuilding pieces of the system to build +confidence in the procedures. +.LP +In general, there are six well-known targets supported by +all the makefiles on the system: +.IP all 9 +This entry is the default target, the same as if no target is specified. +This target builds the kernel, binary or library, as well as its +associated manual pages. +This target \fBdoes not\fP build the dependency files. +Some of the utilities require that a \fImake depend\fP be done before +a \fImake all\fP can succeed. +.IP depend +Build the include file dependency file, ``.depend'', which is +read by +.Xr make . +See +.Xr mkdep (1) +for further details. +.IP install +Install the kernel, binary or library, as well as its associated +manual pages. +See +.Xr install (1) +for further details. +.IP clean +Remove the kernel, binary or library, as well as any object files +created when building it. +.IP cleandir +The same as clean, except that the dependency files and formatted +manual pages are removed as well. +.IP obj +Build a shadow directory structure in the area referenced by +.Pn /usr/obj +and create a symbolic link in the current source directory to +referenced it, named ``obj''. +Once this shadow structure has been created, all the files created by +.Xr make +will live in the shadow structure, and +.Pn /usr/src +may be mounted read-only by multiple machines. +Doing a \fImake obj\fP in +.Pn /usr/src +will build the shadow directory structure for everything on the +system except for the contributed, old, and kernel software. +.PP +The system consists of three major parts: +the kernel itself, found in +.Pn /usr/src/sys , +the libraries , found in +.Pn /usr/src/lib , +and the user programs (the rest of +.Pn /usr/src ). +.PP +Deprecated software, found in +.Pn /usr/src/old , +often has old style makefiles; +some of it does not compile under \*(4B at all. +.PP +Contributed software, found in +.Pn /usr/src/contrib , +usually does not support the ``cleandir'', ``depend'', or ``obj'' targets. +.PP +The kernel does not support the ``obj'' shadow structure. +All kernels are compiled in subdirectories of +.Pn /usr/src/sys/compile +which is usually abbreviated as +.Pn /sys/compile . +If you want to mount your source tree read-only, +.Pn /usr/src/sys/compile +will have to be on a separate filesystem from +.Pn /usr/src . +Separation from +.Pn /usr/src +can be done by making +.Pn /usr/src/sys/compile +a symbolic link that references +.Pn /usr/obj/sys/compile . +If it is a symbolic link, the \fIS\fP variable in the kernel +Makefile must be changed from +.Pn \&../.. +to the absolute pathname needed to locate the kernel sources, usually +.Pn /usr/src/sys . +The symbolic link created by +.Xr config (8) +for +.Pn machine +must also be manually changed to an absolute pathname. +Finally, the +.Pn /usr/src/sys/libkern/obj +directory must be located in +.Pn /usr/obj/sys/libkern . +.PP +Each of the standard utilities and libraries may be built and +installed by changing directories into the correct location and +doing: +.DS +\fB#\fP \fImake\fP +\fB#\fP \fImake install\fP +.DE +Note, if system include files have changed between compiles, +.Xr make +will not do the correct dependency checks if the dependency +files have not been built using the ``depend'' target. +.PP +The entire library and utility suite for the system may be recompiled +from scratch by changing directory to +.Pn /usr/src +and doing: +.DS +\fB#\fP \fImake build\fP +.DE +This target installs the system include files, cleans the source +tree, builds and installs the libraries, and builds and installs +the system utilities. +.PP +To recompile a specific program, first determine where the binary +resides with the +.Xr whereis (1) +command, then change to the corresponding source directory and build +it with the Makefile in the directory. +For instance, to recompile ``passwd'', +all one has to do is: +.DS +\fB#\fP \fIwhereis passwd\fP +\fB/usr/bin/passwd\fP +\fB#\fP \fIcd /usr/src/usr.bin/passwd\fP +\fB#\fP \fImake\fP +\fB#\fP \fImake install\fP +.DE +this will compile and install the +.Xr passwd +utility. +.PP +If you wish to recompile and install all programs into a particular +target area you can override the default path prefix by doing: +.DS +\fB#\fP \fImake\fP +\fB#\fP \fImake DESTDIR=\fPpathname \fIinstall\fP +.DE +Similarly, the mode, owner, group, and other characteristics of +the installed object can be modified by changing other default +make variables. +See +.Xr make (1), +.Pn /usr/src/share/mk/bsd.README , +and the ``.mk'' scripts in the +.Pn /usr/share/mk +directory for more information. +.PP +If you modify the C library or system include files, to change a +system call for example, and want to rebuild and install everything, +you have to be a little careful. +You must ensure that the include files are installed before anything +is compiled, and that the libraries are installed before the remainder +of the source, otherwise the loaded images will not contain the new +routine from the library. +If include files have been modified, the following commands should +be done first: +.DS +\fB#\fP \fIcd /usr/src/include\fP +\fB#\fP \fImake install\fP +.DE +Then, if, for example, C library files have been modified, the +following commands should be executed: +.DS +\fB#\fP \fIcd /usr/src/lib/libc\fP +\fB#\fP \fImake depend\fP +\fB#\fP \fImake\fP +\fB#\fP \fImake install\fP +\fB#\fP \fIcd /usr/src\fP +\fB#\fP \fImake depend\fP +\fB#\fP \fImake\fP +\fB#\fP \fImake install\fP +.DE +Alternatively, the \fImake build\fP command described above will +accomplish the same tasks. +This takes several hours on a reasonably configured machine. +.Sh 2 "Making local modifications" +.PP +The source for locally written commands is normally stored in +.Pn /usr/src/local , +and their binaries are kept in +.Pn /usr/local/bin . +This isolation of local binaries allows +.Pn /usr/bin , +and +.Pn /bin +to correspond to the distribution tape (and to the manuals that +people can buy). +People using local commands should be made aware that they are not +in the base manual. +Manual pages for local commands should be installed in +.Pn /usr/local/man/cat[1-8]. +The +.Xr man (1) +command automatically finds manual pages placed in +/usr/local/man/cat[1-8] to encourage this practice (see +.Xr man.conf (5)). +.Sh 2 "Accounting" +.PP +UNIX optionally records two kinds of accounting information: +connect time accounting and process resource accounting. The connect +time accounting information is stored in the file +.Pn /var/log/wtmp , +which is summarized by the program +.Xr ac (8). +The process time accounting information is stored in the file +.Pn /var/account/acct +after it is enabled by +.Xr accton (8), +and is analyzed and summarized by the program +.Xr sa (8). +.PP +If you need to recharge for computing time, you can develop +procedures based on the information provided by these commands. +A convenient way to do this is to give commands to the clock daemon +.Pn /usr/sbin/cron +to be executed every day at a specified time. +This is done by adding lines to +.Pn /etc/crontab.local ; +see +.Xr cron (8) +for details. +.Sh 2 "Resource control" +.PP +Resource control in the current version of UNIX is more +elaborate than in most UNIX systems. The disk quota +facilities developed at the University of Melbourne have +been incorporated in the system and allow control over the +number of files and amount of disk space each user and/or group may use +on each filesystem. In addition, the resources consumed +by any single process can be limited by the mechanisms of +.Xr setrlimit (2). +As distributed, the latter mechanism +is voluntary, though sites may choose to modify the login +mechanism to impose limits not covered with disk quotas. +.PP +To use the disk quota facilities, the system must be +configured with ``options QUOTA''. Filesystems may then +be placed under the quota mechanism by creating a null file +.Pn quota.user +and/or +.Pn quota.group +at the root of the filesystem, running +.Xr quotacheck (8), +and modifying +.Pn /etc/fstab +to show that the filesystem is to run +with disk quotas (options userquota and/or groupquota). +The +.Xr quotaon (8) +program may then be run to enable quotas. +.PP +Individual quotas are applied by using the quota editor +.Xr edquota (8). +Users may view their quotas (but not those of other users) with the +.Xr quota (1) +program. The +.Xr repquota (8) +program may be used to summarize the quotas and current +space usage on a particular filesystem or filesystems. +.PP +Quotas are enforced with \fIsoft\fP and \fIhard\fP limits. +When a user and/or group first reaches a soft limit on a resource, a +message is generated on their terminal. If the user and/or group fails to +lower the resource usage below the soft limit +for longer than the time limit established for that filesystem +(default seven days) the system then treats the soft limit as a +\fIhard\fP limit and disallows any allocations until enough space is +reclaimed to bring the user and/or group back below the soft limit. +Hard limits are enforced strictly resulting in errors when a user +and/or group tries to create or write a file. Each time a hard limit is +exceeded the system will generate a message on the user's terminal. +.PP +Consult the auxiliary document, ``Disc Quotas in a UNIX Environment'' (SMM:4) +and the appropriate manual entries for more information. +.Sh 2 "Network troubleshooting" +.PP +If you have anything more than a trivial network configuration, +from time to time you are bound to run into problems. Before +blaming the software, first check your network connections. On +networks such as the Ethernet a +loose cable tap or misplaced power cable can result in severely +deteriorated service. The +.Xr netstat (1) +program may be of aid in tracking down hardware malfunctions. +In particular, look at the \fB\-i\fP and \fB\-s\fP options in the manual page. +.PP +Should you believe a communication protocol problem exists, +consult the protocol specifications and attempt to isolate the +problem in a packet trace. The SO_DEBUG option may be supplied +before establishing a connection on a socket, in which case the +system will trace all traffic and internal actions (such as timers +expiring) in a circular trace buffer. +This buffer may then be printed out with the +.Xr trpt (8) +program. +Most of the servers distributed with the system +accept a \fB\-d\fP option forcing +all sockets to be created with debugging turned on. +Consult the appropriate manual pages for more information. +.Sh 2 "Files that need periodic attention" +.PP +We conclude the discussion of system operations by listing +the files that require periodic attention or are system specific: +.TS +center; +lfC l. +/etc/fstab how disk partitions are used +/etc/disktab default disk partition sizes/labels +/etc/printcap printer database +/etc/gettytab terminal type definitions +/etc/remote names and phone numbers of remote machines for \fItip\fP(1) +/etc/group group memberships +/etc/motd message of the day +/etc/master.passwd password file; each account has a line +/etc/rc.local local system restart script; runs reboot; starts daemons +/etc/inetd.conf local internet servers +/etc/hosts local host name database +/etc/networks network name database +/etc/services network services database +/etc/hosts.equiv hosts under same administrative control +/etc/syslog.conf error log configuration for \fIsyslogd\fP\|(8) +/etc/ttys enables/disables ports +/etc/crontab commands that are run periodically +/etc/crontab.local local commands that are run periodically +/etc/aliases mail forwarding and distribution groups +/var/account/acct raw process account data +/var/log/messages system error log +/var/log/wtmp login session accounting +.TE +.pn 2 +.bp +.PX diff --git a/share/doc/smm/01.setup/Makefile b/share/doc/smm/01.setup/Makefile new file mode 100644 index 00000000000..ce68a5c0f54 --- /dev/null +++ b/share/doc/smm/01.setup/Makefile @@ -0,0 +1,15 @@ +# @(#)Makefile 8.1 (Berkeley) 7/27/93 + +DIR= smm/01.setup +SRCS= 0.t 1.t 2.t 3.t 4.t 5.t 6.t +FILES= ${SRCS} +MACROS= -ms + +paper.ps: ${SRCS} + ${TBL} ${SRCS} | ${ROFF} > ${.TARGET} + +install: ${SRCS} + install -c -o ${BINOWN} -g ${BINGRP} -m 444 \ + Makefile ${FILES} ${EXTRA} ${DESTDIR}${BINDIR}/${DIR} + +.include <bsd.doc.mk> diff --git a/share/doc/smm/01.setup/spell.ok b/share/doc/smm/01.setup/spell.ok new file mode 100644 index 00000000000..c4f58a2a14d --- /dev/null +++ b/share/doc/smm/01.setup/spell.ok @@ -0,0 +1,618 @@ +A1096A +AA +ACU +AMD +Automounter +BA +BLOCKSIZE +BSD +Bb +Bostic +Bourne +Bs +Bz +CCI +CCITT +CLNP +CLTP +COMPAT +CPU's +CS80 +CSRG +CW +Catseye +Cyl +DAT +DECstation +DESTDIR +DISK's +DISKTYPE +DMA +DN11 +DV +DaVinci +Dk +Dn +Dy +EBCDIC +EEPROM's +EINTR +EISA +EOT +ERESTART +ESIS +Emulex +Exabyte +FDDI +FIPS +FPU +FTAM +Filesystem +Filesystems +GCC +GENERIC.hp300 +GX +Gatorbox +HDLC +HIL +HP +HP's +HP300 +HP300s +HP433 +HP9000 +HPBSD +HPUXCOMPAT +Hibler +IB +ICMP +IDP +IDs +IFLNK +IP +IPC +IPSENDREDIRECTS +IPX +ISA +ISO +ISVTX +Intel +Jul +Karels +Kerberos +L.aliases +L.cmds +L.sys +LAN +LAPB +LFS +LH +LK201 +LOGFILE +Leffler +Luna +MAKEDEV.local +MB +MC68040 +MFS +MIB +MIPS +MISC +MMU +MT02 +Macklem +Makefile +Makefiles +Maxtor +McKusick +NFS +NIC.ARPA +NPSECT +NTP +OHPUX +OS +OSI +OSes +Omron +PCATCH +PDT +PMAD +PMAG +PMAX +PMAZ +POSIX +POSIX.1 +POSIX.2 +PSD:5 +PVRX +Pathnames +Pendry +Postel +README +RFC +RH +RISC +ROM +RS232 +RZ23 +RZ55 +RZ57 +SCSI +SEQF +SIOCGIFCONF +SLC +SMM:1 +SMM:10 +SMM:13 +SMM:14 +SMM:2 +SMM:3 +SMM:4 +SMM:6 +SMM:7 +SMM:8 +SMM:9 +SMTP +SPARC +SPARCstation +SPP +SRC +SUNOS +Sbus +Solaris +Standalone +Std1003.1 +Std1003.2 +SunOS +TAI +TAPE's +TBOOT +TCP +TIOCSCTTY +TK50 +TM +TP4 +TURBOchannel +TVRX +TZ +Tcpmux +Topcat +Tue +UCB +UDP +UFS +ULTRIX +ULTRIXCOMPAT +UNIX''SMM:1 +USERFILE +USL +UTC +UUAIDS +UX +Ux +VAX +VFS +VME +X11R5 +XX +Xinu +a,c,u,p +a.out +adaptor +adaptors +addrlen +adiron +adm +aka +aliases.db +amd +autochanger +autoconf +autoconfiguration +autoconfigures +bdes +bootblock +bootimage +bootp +bootpath +bootrom +bootsd +bootstrapped +bs +bsd +bsd.README +btree +bwtwo +c.f +callback +cbosg +cbosgd +centronics +cfb0 +cgsix +cgthree +changelist +chflags +chico +chpass +cksum +cleandir +cleanerd +clnp +cltp +conf +conformant +contrib +cpio +crontab +crontab.local +cs +csh.cshrc +csh.login +csh.logout +cshrc +ct +cul0 +db +dbopen +dc7085 +deadfs +dev +dgram +dialcode +dialcodes +dict +disk3 +diskful +disklabel +disklabels +disktab +dm +dm.conf +dm.config +dma +doc +endian +es +esis +ext +fd +fdesc +fifofs +files.HOST +fileservers +filesystem +filesystems +foo +frags +friend.host.inet.number +fsf +fstab +fstype +ftpusers +ftpwelcome +fts +funopen +gcc +genvmunix +getcap +gettytab +gid +gid's +gids +groff +groupquota +hangup +hanoi +heapsort +hexdump +hier +host.inet.number +hostmaster +hosts.equiv +hosts.lpd +hp +hp300 +hpib0 +inetd.conf +inline +inode +int +intr +iob +iso +kbyte +kbytes +kdb +kdump +kerberos +kerberosIV +kernfs +kmem +krb.conf +ksh +ktrace +kvm +labelling +lastlog +ld.so.cache +le +le0 +lfs +lib +libc +libdata +libedit +libexec +libkern +libkvm +libutil +localhost +lofs +logname +loopback +lq +lun +luna68k +magtape +mail.local +mail.rc +maillog +maillog.0 +maillog.1 +maillog.2 +maillog.3 +maillog.4 +maillog.5 +maillog.6 +maillog.7 +makefiles +man.conf +man0 +manl +master.passwd +maxine +mckusick +mdec +mediainit +mem +mergesort +mfb0 +mfs +misc +miscfs +mk +mkdb +mkdep +mkfifo +mmap +mnt +mono +motd +mountd +mqueue +msgbuf +mtree +my.domain +myfriend +myfriend.my.domain +myname +myname.my.domain +mysitename +namelen +net.inet.ip.redirect +netgroup +netinet +netiso +netmask +netns +netstart +netx25 +newdev +newlfs +news3400 +newvmunix +nfs +nfsd +nfsiod +nfsstat +nfssvc +nodump +nowait +npsect +nr +nrmt0 +nrst0 +nsect +nullfs +nvi +obj +ogin +ok +okeeffe.CS.Berkeley.EDU +olddev +oldroot +opr +osockaddr +out0123456789 +out2010123456 +pageable +pathname +pathnames +pcc +picasso.CS.Berkeley.EDU +pid +pm0 +pmake +pmap +pmax +posix +printcap +pt0 +pty +pty.c +pty0 +pty1 +pty2 +pty3 +ptyp +pwd.db +quota.group +quota.user +quotas.user +radixsort +rc.local +rct +rd +rd0 +rdsk +recno +rew +rf +rhosts +rmt0 +rmt12 +ro +roff +root.dump +root.image +routedflags +rq +rrd0d +rrz?a +rrz?c +rsd0d +rsd3a +rsf +rst0 +rw +rxx0 +rz +rz0 +rz1c +rz?a +sbin +sc14 +sc7 +scsi0 +scsiformat +sd +sd0 +sd2b +sd3 +sd3a +sd660 +secretmail +securelevel +securettys +sendmail.cf +setsid +setup.tblms +shar +shareable +sizeof +skel +sockaddr +sparc +specfs +spwd.db +sr +src +srvtab +st +st.st +standalone +std +std.9600 +stderr +stdin +stdout +subr +sunos +sw +sy +sysctl +sysctl.c +syslog.0 +syslog.1 +syslog.2 +syslog.3 +syslog.4 +syslog.5 +syslog.6 +syslog.7 +syslog.conf +tahoe +tapehost +tcp +termios +tmac +tmp +toor +tps +tput +tsleep +tty.c +tty00 +ttya +ttyb +ttyd +ttyp +ttyp0 +ttytype +types.h +tz +tz6 +ucb +ufs +uid +uid's +uids +uipc +umap +umapfs +un.h +unaddr.sun +uname +unistd.h +userid +username +userquota +usr.bin +usr.sbin +utmp +uucp.daemon +uucppublic +uw +vangogh.CS.Berkeley.EDU +var +vax +ventel +vfs +vm +vmunix +vmunix.net +vmunix.tape +vnode +vnodes +vol +vt100 +wildcard +wr +wsrc +wtmp +xargs +xbpf +xcfb0 +xf +xp +xpbf +xpf +xsf +xx +xx0 +xxx +your.host.inet.number +yymmddhhmm +zA +zoneinfo diff --git a/share/doc/smm/04.quotas/Makefile b/share/doc/smm/04.quotas/Makefile new file mode 100644 index 00000000000..f55290df9f0 --- /dev/null +++ b/share/doc/smm/04.quotas/Makefile @@ -0,0 +1,7 @@ +# @(#)Makefile 8.1 (Berkeley) 6/8/93 + +DIR= smm/04.quotas +SRCS= quotas.ms +MACROS= -ms + +.include <bsd.doc.mk> diff --git a/share/doc/smm/04.quotas/quotas.ms b/share/doc/smm/04.quotas/quotas.ms new file mode 100644 index 00000000000..36918302518 --- /dev/null +++ b/share/doc/smm/04.quotas/quotas.ms @@ -0,0 +1,318 @@ +.\" Copyright (c) 1983, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)quotas.ms 8.1 (Berkeley) 6/8/93 +.\" +.EH 'SMM:4-%''Disc Quotas in a \s-2UNIX\s+2 Environment' +.OH 'Disc Quotas in a \s-2UNIX\s+2 Environment''SMM:4-%' +.ND 5th July, 1983 +.TL +Disc Quotas in a \s-2UNIX\s+2\s-3\u*\d\s0 Environment +.FS +* UNIX is a trademark of Bell Laboratories. +.FE +.AU +Robert Elz +.AI +Department of Computer Science +University of Melbourne, +Parkville, +Victoria, +Australia. +.AB +.PP +In most computing environments, disc space is not +infinite. +The disc quota system provides a mechanism +to control usage of disc space, on an +individual basis. +.PP +Quotas may be set for each individual user, on any, or +all filesystems. +.PP +The quota system will warn users when they +exceed their allotted limit, but allow some +extra space for current work. +Repeatedly remaining over quota at logout, +will cause a fatal over quota condition eventually. +.PP +The quota system is an optional part of +\s-2VMUNIX\s0 that may be included when the +system is configured. +.AE +.NH 1 +Users' view of disc quotas +.PP +To most users, disc quotas will either be of no concern, +or a fact of life that cannot be avoided. +The +\fIquota\fP\|(1) +command will provide information on any disc quotas +that may have been imposed upon a user. +.PP +There are two individual possible quotas that may be +imposed, usually if one is, both will be. +A limit can be set on the amount of space a user +can occupy, and there may be a limit on the number +of files (inodes) he can own. +.PP +.I Quota +provides information on the quotas that have +been set by the system administrators, in each +of these areas, and current usage. +.PP +There are four numbers for each limit, the current +usage, soft limit (quota), hard limit, and number +of remaining login warnings. +The soft limit is the number of 1K blocks (or files) +that the user is expected to remain below. +Each time the user's usage goes past this limit, +he will be warned. +The hard limit cannot be exceeded. +If a user's usage reaches this number, further +requests for space (or attempts to create a file) +will fail with an EDQUOT error, and the first time +this occurs, a message will be written to the user's +terminal. +Only one message will be output, until space occupied +is reduced below the limit, and reaches it again, +in order to avoid continual noise from those +programs that ignore write errors. +.PP +Whenever a user logs in with a usage greater than +his soft limit, he will be warned, and his login +warning count decremented. +When he logs in under quota, the counter is reset +to its maximum value (which is a system configuration +parameter, that is typically 3). +If the warning count should ever reach zero (caused +by three successive logins over quota), the +particular limit that has been exceeded will be treated +as if the hard limit has been reached, and no +more resources will be allocated to the user. +The \fBonly\fP way to reset this condition is +to reduce usage below quota, then log in again. +.NH 2 +Surviving when quota limit is reached +.PP +In most cases, the only way to recover from over +quota conditions, is to abort whatever activity was in progress +on the filesystem that has reached its limit, remove +sufficient files to bring the limit back below quota, +and retry the failed program. +.PP +However, if you are in the editor and a write fails +because of an over quota situation, that is not +a suitable course of action, as it is most likely +that initially attempting to write the file +will have truncated its previous contents, so should +the editor be aborted without correctly writing the +file not only will the recent changes be lost, but +possibly much, or even all, of the data +that previously existed. +.PP +There are several possible safe exits for a user +caught in this situation. +He may use the editor \fB!\fP shell escape command to +examine his file space, and remove surplus files. +Alternatively, using \fIcsh\fP, he may suspend the +editor, remove some files, then resume it. +A third possibility, is to write the file to +some other filesystem (perhaps to a file on /tmp) +where the user's quota has not been exceeded. +Then after rectifying the quota situation, +the file can be moved back to the filesystem +it belongs on. +.NH 1 +Administering the quota system +.PP +To set up and establish the disc quota system, +there are several steps necessary to be performed +by the system administrator. +.PP +First, the system must be configured to include +the disc quota sub-system. +This is done by including the line: +.DS +options QUOTA +.DE +in the system configuration file, then running +\fIconfig\fP\|(8) +followed by a system configuration\s-3\u*\d\s0. +.FS +* See also the document ``Building 4.2BSD UNIX Systems with Config''. +.FE +.PP +Second, a decision as to what filesystems need to have +quotas applied needs to be made. +Usually, only filesystems that house users' home directories, +or other user files, will need to be subjected to +the quota system, though it may also prove useful to +also include \fB/usr\fR. +If possible, \fB/tmp\fP should usually be free of quotas. +.PP +Having decided on which filesystems quotas need to be +set upon, the administrator should then allocate the +available space amongst the competing needs. How this +should be done is (way) beyond the scope of this document. +.PP +Then, the +\fIedquota\fP\|(8) +command can be used to actually set the limits desired upon +each user. Where a number of users are to be given the +same quotas (a common occurrence) the \fB\-p\fP switch +to edquota will allow this to be easily accomplished. +.PP +Once the quotas are set, ready to operate, the system +must be informed to enforce quotas on the desired filesystems. +This is accomplished with the +\fIquotaon\fP\|(8) +command. +.I Quotaon +will either enable quotas for a particular filesystem, or +with the \fB\-a\fP switch, will enable quotas for each +filesystem indicated in \fB/etc/fstab\fP as using quotas. +See +\fIfstab\fP\|(5) +for details. +Most sites using the quota system, will include the +line +.DS C +/etc/quotaon -a +.DE +in \fB/etc/rc.local\fP. +.PP +Should quotas need to be disabled, the +\fIquotaoff\fP(8) +command will do that, however, should the filesystem be +about to be dismounted, the +\fIumount\fP\|(8) +command will disable quotas immediately before the +filesystem is unmounted. +This is actually an effect of the +\fIumount\fP\|(2) +system call, and it guarantees that the quota system +will not be disabled if the umount would fail +because the filesystem is not idle. +.PP +Periodically (certainly after each reboot, and when quotas +are first enabled for a filesystem), the records retained +in the quota file should be checked for consistency with +the actual number of blocks and files allocated to +the user. +The +\fIquotacheck\fP\|(8) +command can be used to accomplish this. +It is not necessary to dismount the filesystem, or disable +the quota system to run this command, though on +active filesystems inaccurate results may occur. +This does no real harm in most cases, another run of +.I quotacheck +when the filesystem is idle will certainly correct any inaccuracy. +.PP +The super-user may use the +\fIquota\fP\|(1) +command to examine the usage and quotas of any user, and +the +\fIrepquota\fP\|(8) +command may be used to check the usages and limits for +all users on a filesystem. +.NH 1 +Some implementation detail. +.PP +Disc quota usage and information is stored in a file on the +filesystem that the quotas are to be applied to. +Conventionally, this file is \fBquotas\fR in the root of +the filesystem. +While this name is not known to the system in any way, +several of the user level utilities "know" it, and +choosing any other name would not be wise. +.PP +The data in the file comprises an array of structures, indexed +by uid, one structure for each user on the system (whether +the user has a quota on this filesystem or not). +If the uid space is sparse, then the file may have holes +in it, which would be lost by copying, so it is best to +avoid this. +.PP +The system is informed of the existence of the quota +file by the +\fIsetquota\fP\|(2) +system call. +It then reads the quota entries for each user currently +active, then for any files open owned by users who +are not currently active. +Each subsequent open of a file on the filesystem, will +be accompanied by a pairing with its quota information. +In most cases this information will be retained in core, +either because the user who owns the file is running some +process, because other files are open owned by the same +user, or because some file (perhaps this one) was recently +accessed. +In memory, the quota information is kept hashed by user-id +and filesystem, and retained in an LRU chain so recently +released data can be easily reclaimed. +Information about those users whose last process has +recently terminated is also retained in this way. +.PP +Each time a block is accessed or released, and each time an inode +is allocated or freed, the quota system gets told +about it, and in the case of allocations, gets the +opportunity to object. +.PP +Measurements have shown +that the quota code uses a very small percentage of the system +cpu time consumed in writing a new block to disc. +.NH 1 +Acknowledgments +.PP +The current disc quota system is loosely based upon a very +early scheme implemented at the University of New South +Wales, and Sydney University in the mid 70's. That system +implemented a single combined limit for both files and blocks +on all filesystems. +.PP +A later system was implemented at the University of Melbourne +by the author, but was not kept highly accurately, eg: +chown's (etc) did not affect quotas, nor did i/o to a file +other than one owned by the instigator. +.PP +The current system has been running (with only minor modifications) +since January 82 at Melbourne. +It is actually just a small part of a much broader resource +control scheme, which is capable of controlling almost +anything that is usually uncontrolled in unix. The rest +of this is, as yet, still in a state where it is far too +subject to change to be considered for distribution. +.PP +For the 4.2BSD release, much work has been done to clean +up and sanely incorporate the quota code by Sam Leffler and +Kirk McKusick at The University of California at Berkeley. diff --git a/share/doc/smm/05.fastfs/0.t b/share/doc/smm/05.fastfs/0.t new file mode 100644 index 00000000000..9cc759b87f4 --- /dev/null +++ b/share/doc/smm/05.fastfs/0.t @@ -0,0 +1,159 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)0.t 8.1 (Berkeley) 6/8/93 +.\" +.EQ +delim $$ +.EN +.if n .ND +.TL +A Fast File System for UNIX* +.EH 'SMM:05-%''A Fast File System for \s-2UNIX\s+2' +.OH 'A Fast File System for \s-2UNIX\s+2''SMM:05-%' +.AU +Marshall Kirk McKusick, William N. Joy\(dg, +Samuel J. Leffler\(dd, Robert S. Fabry +.AI +Computer Systems Research Group +Computer Science Division +Department of Electrical Engineering and Computer Science +University of California, Berkeley +Berkeley, CA 94720 +.AB +.FS +* UNIX is a trademark of Bell Laboratories. +.FE +.FS +\(dg William N. Joy is currently employed by: +Sun Microsystems, Inc, 2550 Garcia Avenue, Mountain View, CA 94043 +.FE +.FS +\(dd Samuel J. Leffler is currently employed by: +Lucasfilm Ltd., PO Box 2009, San Rafael, CA 94912 +.FE +.FS +This work was done under grants from +the National Science Foundation under grant MCS80-05144, +and the Defense Advance Research Projects Agency (DoD) under +ARPA Order No. 4031 monitored by Naval Electronic System Command under +Contract No. N00039-82-C-0235. +.FE +A reimplementation of the UNIX file system is described. +The reimplementation provides substantially higher throughput +rates by using more flexible allocation policies +that allow better locality of reference and can +be adapted to a wide range of peripheral and processor characteristics. +The new file system clusters data that is sequentially accessed +and provides two block sizes to allow fast access to large files +while not wasting large amounts of space for small files. +File access rates of up to ten times faster than the traditional +UNIX file system are experienced. +Long needed enhancements to the programmers' +interface are discussed. +These include a mechanism to place advisory locks on files, +extensions of the name space across file systems, +the ability to use long file names, +and provisions for administrative control of resource usage. +.sp +.LP +Revised February 18, 1984 +.AE +.LP +.sp 2 +CR Categories and Subject Descriptors: +D.4.3 +.B "[Operating Systems]": +File Systems Management \- +.I "file organization, directory structures, access methods"; +D.4.2 +.B "[Operating Systems]": +Storage Management \- +.I "allocation/deallocation strategies, secondary storage devices"; +D.4.8 +.B "[Operating Systems]": +Performance \- +.I "measurements, operational analysis"; +H.3.2 +.B "[Information Systems]": +Information Storage \- +.I "file organization" +.sp +Additional Keywords and Phrases: +UNIX, +file system organization, +file system performance, +file system design, +application program interface. +.sp +General Terms: +file system, +measurement, +performance. +.bp +.ce +.B "TABLE OF CONTENTS" +.LP +.sp 1 +.nf +.B "1. Introduction" +.LP +.sp .5v +.nf +.B "2. Old file system +.LP +.sp .5v +.nf +.B "3. New file system organization +3.1. Optimizing storage utilization +3.2. File system parameterization +3.3. Layout policies +.LP +.sp .5v +.nf +.B "4. Performance +.LP +.sp .5v +.nf +.B "5. File system functional enhancements +5.1. Long file names +5.2. File locking +5.3. Symbolic links +5.4. Rename +5.5. Quotas +.LP +.sp .5v +.nf +.B Acknowledgements +.LP +.sp .5v +.nf +.B References diff --git a/share/doc/smm/05.fastfs/1.t b/share/doc/smm/05.fastfs/1.t new file mode 100644 index 00000000000..dbdafc4e758 --- /dev/null +++ b/share/doc/smm/05.fastfs/1.t @@ -0,0 +1,112 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)1.t 8.1 (Berkeley) 6/8/93 +.\" +.ds RH Introduction +.NH +Introduction +.PP +This paper describes the changes from the original 512 byte UNIX file +system to the new one released with the 4.2 Berkeley Software Distribution. +It presents the motivations for the changes, +the methods used to effect these changes, +the rationale behind the design decisions, +and a description of the new implementation. +This discussion is followed by a summary of +the results that have been obtained, +directions for future work, +and the additions and changes +that have been made to the facilities that are +available to programmers. +.PP +The original UNIX system that runs on the PDP-11\(dg +.FS +\(dg DEC, PDP, VAX, MASSBUS, and UNIBUS are +trademarks of Digital Equipment Corporation. +.FE +has simple and elegant file system facilities. File system input/output +is buffered by the kernel; +there are no alignment constraints on +data transfers and all operations are made to appear synchronous. +All transfers to the disk are in 512 byte blocks, which can be placed +arbitrarily within the data area of the file system. Virtually +no constraints other than available disk space are placed on file growth +[Ritchie74], [Thompson78].* +.FS +* In practice, a file's size is constrained to be less than about +one gigabyte. +.FE +.PP +When used on the VAX-11 together with other UNIX enhancements, +the original 512 byte UNIX file +system is incapable of providing the data throughput rates +that many applications require. +For example, +applications +such as VLSI design and image processing +do a small amount of processing +on a large quantities of data and +need to have a high throughput from the file system. +High throughput rates are also needed by programs +that map files from the file system into large virtual +address spaces. +Paging data in and out of the file system is likely +to occur frequently [Ferrin82b]. +This requires a file system providing +higher bandwidth than the original 512 byte UNIX +one that provides only about +two percent of the maximum disk bandwidth or about +20 kilobytes per second per arm [White80], [Smith81b]. +.PP +Modifications have been made to the UNIX file system to improve +its performance. +Since the UNIX file system interface +is well understood and not inherently slow, +this development retained the abstraction and simply changed +the underlying implementation to increase its throughput. +Consequently, users of the system have not been faced with +massive software conversion. +.PP +Problems with file system performance have been dealt with +extensively in the literature; see [Smith81a] for a survey. +Previous work to improve the UNIX file system performance has been +done by [Ferrin82a]. +The UNIX operating system drew many of its ideas from Multics, +a large, high performance operating system [Feiertag71]. +Other work includes Hydra [Almes78], +Spice [Thompson80], +and a file system for a LISP environment [Symbolics81]. +A good introduction to the physical latencies of disks is +described in [Pechura83]. +.ds RH Old file system +.sp 2 +.ne 1i diff --git a/share/doc/smm/05.fastfs/2.t b/share/doc/smm/05.fastfs/2.t new file mode 100644 index 00000000000..33d9ade4c6b --- /dev/null +++ b/share/doc/smm/05.fastfs/2.t @@ -0,0 +1,143 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)2.t 8.1 (Berkeley) 6/8/93 +.\" +.ds RH Old file system +.NH +Old File System +.PP +In the file system developed at Bell Laboratories +(the ``traditional'' file system), +each disk drive is divided into one or more +partitions. Each of these disk partitions may contain +one file system. A file system never spans multiple +partitions.\(dg +.FS +\(dg By ``partition'' here we refer to the subdivision of +physical space on a disk drive. In the traditional file +system, as in the new file system, file systems are really +located in logical disk partitions that may overlap. This +overlapping is made available, for example, +to allow programs to copy entire disk drives containing multiple +file systems. +.FE +A file system is described by its super-block, +which contains the basic parameters of the file system. +These include the number of data blocks in the file system, +a count of the maximum number of files, +and a pointer to the \fIfree list\fP, a linked +list of all the free blocks in the file system. +.PP +Within the file system are files. +Certain files are distinguished as directories and contain +pointers to files that may themselves be directories. +Every file has a descriptor associated with it called an +.I "inode". +An inode contains information describing ownership of the file, +time stamps marking last modification and access times for the file, +and an array of indices that point to the data blocks for the file. +For the purposes of this section, we assume that the first 8 blocks +of the file are directly referenced by values stored +in an inode itself*. +.FS +* The actual number may vary from system to system, but is usually in +the range 5-13. +.FE +An inode may also contain references to indirect blocks +containing further data block indices. +In a file system with a 512 byte block size, a singly indirect +block contains 128 further block addresses, +a doubly indirect block contains 128 addresses of further singly indirect +blocks, +and a triply indirect block contains 128 addresses of further doubly indirect +blocks. +.PP +A 150 megabyte traditional UNIX file system consists +of 4 megabytes of inodes followed by 146 megabytes of data. +This organization segregates the inode information from the data; +thus accessing a file normally incurs a long seek from the +file's inode to its data. +Files in a single directory are not typically allocated +consecutive slots in the 4 megabytes of inodes, +causing many non-consecutive blocks of inodes +to be accessed when executing +operations on the inodes of several files in a directory. +.PP +The allocation of data blocks to files is also suboptimum. +The traditional +file system never transfers more than 512 bytes per disk transaction +and often finds that the next sequential data block is not on the same +cylinder, forcing seeks between 512 byte transfers. +The combination of the small block size, +limited read-ahead in the system, +and many seeks severely limits file system throughput. +.PP +The first work at Berkeley on the UNIX file system attempted to improve both +reliability and throughput. +The reliability was improved by staging modifications +to critical file system information so that they could +either be completed or repaired cleanly by a program +after a crash [Kowalski78]. +The file system performance was improved by a factor of more than two by +changing the basic block size from 512 to 1024 bytes. +The increase was because of two factors: +each disk transfer accessed twice as much data, +and most files could be described without need to access +indirect blocks since the direct blocks contained twice as much data. +The file system with these changes will henceforth be referred to as the +.I "old file system." +.PP +This performance improvement gave a strong indication that +increasing the block size was a good method for improving +throughput. +Although the throughput had doubled, +the old file system was still using only about +four percent of the disk bandwidth. +The main problem was that although the free list was initially +ordered for optimal access, +it quickly became scrambled as files were created and removed. +Eventually the free list became entirely random, +causing files to have their blocks allocated randomly over the disk. +This forced a seek before every block access. +Although old file systems provided transfer rates of up +to 175 kilobytes per second when they were first created, +this rate deteriorated to 30 kilobytes per second after a +few weeks of moderate use because of this +randomization of data block placement. +There was no way of restoring the performance of an old file system +except to dump, rebuild, and restore the file system. +Another possibility, as suggested by [Maruyama76], +would be to have a process that periodically +reorganized the data on the disk to restore locality. +.ds RH New file system +.sp 2 +.ne 1i diff --git a/share/doc/smm/05.fastfs/3.t b/share/doc/smm/05.fastfs/3.t new file mode 100644 index 00000000000..0ab119659ae --- /dev/null +++ b/share/doc/smm/05.fastfs/3.t @@ -0,0 +1,594 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)3.t 8.1 (Berkeley) 6/8/93 +.\" +.ds RH New file system +.NH +New file system organization +.PP +In the new file system organization (as in the +old file system organization), +each disk drive contains one or more file systems. +A file system is described by its super-block, +located at the beginning of the file system's disk partition. +Because the super-block contains critical data, +it is replicated to protect against catastrophic loss. +This is done when the file system is created; +since the super-block data does not change, +the copies need not be referenced unless a head crash +or other hard disk error causes the default super-block +to be unusable. +.PP +To insure that it is possible to create files as large as +$2 sup 32$ bytes with only two levels of indirection, +the minimum size of a file system block is 4096 bytes. +The size of file system blocks can be any power of two +greater than or equal to 4096. +The block size of a file system is recorded in the +file system's super-block +so it is possible for file systems with different block sizes +to be simultaneously accessible on the same system. +The block size must be decided at the time that +the file system is created; +it cannot be subsequently changed without rebuilding the file system. +.PP +The new file system organization divides a disk partition +into one or more areas called +.I "cylinder groups". +A cylinder group is comprised of one or more consecutive +cylinders on a disk. +Associated with each cylinder group is some bookkeeping information +that includes a redundant copy of the super-block, +space for inodes, +a bit map describing available blocks in the cylinder group, +and summary information describing the usage of data blocks +within the cylinder group. +The bit map of available blocks in the cylinder group replaces +the traditional file system's free list. +For each cylinder group a static number of inodes +is allocated at file system creation time. +The default policy is to allocate one inode for each 2048 +bytes of space in the cylinder group, expecting this +to be far more than will ever be needed. +.PP +All the cylinder group bookkeeping information could be +placed at the beginning of each cylinder group. +However if this approach were used, +all the redundant information would be on the top platter. +A single hardware failure that destroyed the top platter +could cause the loss of all redundant copies of the super-block. +Thus the cylinder group bookkeeping information +begins at a varying offset from the beginning of the cylinder group. +The offset for each successive cylinder group is calculated to be +about one track further from the beginning of the cylinder group +than the preceding cylinder group. +In this way the redundant +information spirals down into the pack so that any single track, cylinder, +or platter can be lost without losing all copies of the super-block. +Except for the first cylinder group, +the space between the beginning of the cylinder group +and the beginning of the cylinder group information +is used for data blocks.\(dg +.FS +\(dg While it appears that the first cylinder group could be laid +out with its super-block at the ``known'' location, +this would not work for file systems +with blocks sizes of 16 kilobytes or greater. +This is because of a requirement that the first 8 kilobytes of the disk +be reserved for a bootstrap program and a separate requirement that +the cylinder group information begin on a file system block boundary. +To start the cylinder group on a file system block boundary, +file systems with block sizes larger than 8 kilobytes +would have to leave an empty space between the end of +the boot block and the beginning of the cylinder group. +Without knowing the size of the file system blocks, +the system would not know what roundup function to use +to find the beginning of the first cylinder group. +.FE +.NH 2 +Optimizing storage utilization +.PP +Data is laid out so that larger blocks can be transferred +in a single disk transaction, greatly increasing file system throughput. +As an example, consider a file in the new file system +composed of 4096 byte data blocks. +In the old file system this file would be composed of 1024 byte blocks. +By increasing the block size, disk accesses in the new file +system may transfer up to four times as much information per +disk transaction. +In large files, several +4096 byte blocks may be allocated from the same cylinder so that +even larger data transfers are possible before requiring a seek. +.PP +The main problem with +larger blocks is that most UNIX +file systems are composed of many small files. +A uniformly large block size wastes space. +Table 1 shows the effect of file system +block size on the amount of wasted space in the file system. +The files measured to obtain these figures reside on +one of our time sharing +systems that has roughly 1.2 gigabytes of on-line storage. +The measurements are based on the active user file systems containing +about 920 megabytes of formatted space. +.KF +.DS B +.TS +box; +l|l|l +a|n|l. +Space used % waste Organization +_ +775.2 Mb 0.0 Data only, no separation between files +807.8 Mb 4.2 Data only, each file starts on 512 byte boundary +828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system +866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system +948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system +1128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system +.TE +Table 1 \- Amount of wasted space as a function of block size. +.DE +.KE +The space wasted is calculated to be the percentage of space +on the disk not containing user data. +As the block size on the disk +increases, the waste rises quickly, to an intolerable +45.6% waste with 4096 byte file system blocks. +.PP +To be able to use large blocks without undue waste, +small files must be stored in a more efficient way. +The new file system accomplishes this goal by allowing the division +of a single file system block into one or more +.I "fragments". +The file system fragment size is specified +at the time that the file system is created; +each file system block can optionally be broken into +2, 4, or 8 fragments, each of which is addressable. +The lower bound on the size of these fragments is constrained +by the disk sector size, +typically 512 bytes. +The block map associated with each cylinder group +records the space available in a cylinder group +at the fragment level; +to determine if a block is available, aligned fragments are examined. +Figure 1 shows a piece of a map from a 4096/1024 file system. +.KF +.DS B +.TS +box; +l|c c c c. +Bits in map XXXX XXOO OOXX OOOO +Fragment numbers 0-3 4-7 8-11 12-15 +Block numbers 0 1 2 3 +.TE +Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system. +.DE +.KE +Each bit in the map records the status of a fragment; +an ``X'' shows that the fragment is in use, +while a ``O'' shows that the fragment is available for allocation. +In this example, +fragments 0\-5, 10, and 11 are in use, +while fragments 6\-9, and 12\-15 are free. +Fragments of adjoining blocks cannot be used as a full block, +even if they are large enough. +In this example, +fragments 6\-9 cannot be allocated as a full block; +only fragments 12\-15 can be coalesced into a full block. +.PP +On a file system with a block size of 4096 bytes +and a fragment size of 1024 bytes, +a file is represented by zero or more 4096 byte blocks of data, +and possibly a single fragmented block. +If a file system block must be fragmented to obtain +space for a small amount of data, +the remaining fragments of the block are made +available for allocation to other files. +As an example consider an 11000 byte file stored on +a 4096/1024 byte file system. +This file would uses two full size blocks and one +three fragment portion of another block. +If no block with three aligned fragments is +available at the time the file is created, +a full size block is split yielding the necessary +fragments and a single unused fragment. +This remaining fragment can be allocated to another file as needed. +.PP +Space is allocated to a file when a program does a \fIwrite\fP +system call. +Each time data is written to a file, the system checks to see if +the size of the file has increased*. +.FS +* A program may be overwriting data in the middle of an existing file +in which case space would already have been allocated. +.FE +If the file needs to be expanded to hold the new data, +one of three conditions exists: +.IP 1) +There is enough space left in an already allocated +block or fragment to hold the new data. +The new data is written into the available space. +.IP 2) +The file contains no fragmented blocks (and the last +block in the file +contains insufficient space to hold the new data). +If space exists in a block already allocated, +the space is filled with new data. +If the remainder of the new data contains more than +a full block of data, a full block is allocated and +the first full block of new data is written there. +This process is repeated until less than a full block +of new data remains. +If the remaining new data to be written will +fit in less than a full block, +a block with the necessary fragments is located, +otherwise a full block is located. +The remaining new data is written into the located space. +.IP 3) +The file contains one or more fragments (and the +fragments contain insufficient space to hold the new data). +If the size of the new data plus the size of the data +already in the fragments exceeds the size of a full block, +a new block is allocated. +The contents of the fragments are copied +to the beginning of the block +and the remainder of the block is filled with new data. +The process then continues as in (2) above. +Otherwise, if the new data to be written will +fit in less than a full block, +a block with the necessary fragments is located, +otherwise a full block is located. +The contents of the existing fragments +appended with the new data +are written into the allocated space. +.PP +The problem with expanding a file one fragment at a +a time is that data may be copied many times as a +fragmented block expands to a full block. +Fragment reallocation can be minimized +if the user program writes a full block at a time, +except for a partial block at the end of the file. +Since file systems with different block sizes may reside on +the same system, +the file system interface has been extended to provide +application programs the optimal size for a read or write. +For files the optimal size is the block size of the file system +on which the file is being accessed. +For other objects, such as pipes and sockets, +the optimal size is the underlying buffer size. +This feature is used by the Standard +Input/Output Library, +a package used by most user programs. +This feature is also used by +certain system utilities such as archivers and loaders +that do their own input and output management +and need the highest possible file system bandwidth. +.PP +The amount of wasted space in the 4096/1024 byte new file system +organization is empirically observed to be about the same as in the +1024 byte old file system organization. +A file system with 4096 byte blocks and 512 byte fragments +has about the same amount of wasted space as the 512 byte +block UNIX file system. +The new file system uses less space +than the 512 byte or 1024 byte +file systems for indexing information for +large files and the same amount of space +for small files. +These savings are offset by the need to use +more space for keeping track of available free blocks. +The net result is about the same disk utilization +when a new file system's fragment size +equals an old file system's block size. +.PP +In order for the layout policies to be effective, +a file system cannot be kept completely full. +For each file system there is a parameter, termed +the free space reserve, that +gives the minimum acceptable percentage of file system +blocks that should be free. +If the number of free blocks drops below this level +only the system administrator can continue to allocate blocks. +The value of this parameter may be changed at any time, +even when the file system is mounted and active. +The transfer rates that appear in section 4 were measured on file +systems kept less than 90% full (a reserve of 10%). +If the number of free blocks falls to zero, +the file system throughput tends to be cut in half, +because of the inability of the file system to localize +blocks in a file. +If a file system's performance degrades because +of overfilling, it may be restored by removing +files until the amount of free space once again +reaches the minimum acceptable level. +Access rates for files created during periods of little +free space may be restored by moving their data once enough +space is available. +The free space reserve must be added to the +percentage of waste when comparing the organizations given +in Table 1. +Thus, the percentage of waste in +an old 1024 byte UNIX file system is roughly +comparable to a new 4096/512 byte file system +with the free space reserve set at 5%. +(Compare 11.8% wasted with the old file system +to 6.9% waste + 5% reserved space in the +new file system.) +.NH 2 +File system parameterization +.PP +Except for the initial creation of the free list, +the old file system ignores the parameters of the underlying hardware. +It has no information about either the physical characteristics +of the mass storage device, +or the hardware that interacts with it. +A goal of the new file system is to parameterize the +processor capabilities and +mass storage characteristics +so that blocks can be allocated in an +optimum configuration-dependent way. +Parameters used include the speed of the processor, +the hardware support for mass storage transfers, +and the characteristics of the mass storage devices. +Disk technology is constantly improving and +a given installation can have several different disk technologies +running on a single processor. +Each file system is parameterized so that it can be +adapted to the characteristics of the disk on which +it is placed. +.PP +For mass storage devices such as disks, +the new file system tries to allocate new blocks +on the same cylinder as the previous block in the same file. +Optimally, these new blocks will also be +rotationally well positioned. +The distance between ``rotationally optimal'' blocks varies greatly; +it can be a consecutive block +or a rotationally delayed block +depending on system characteristics. +On a processor with an input/output channel that does not require +any processor intervention between mass storage transfer requests, +two consecutive disk blocks can often be accessed +without suffering lost time because of an intervening disk revolution. +For processors without input/output channels, +the main processor must field an interrupt and +prepare for a new disk transfer. +The expected time to service this interrupt and +schedule a new disk transfer depends on the +speed of the main processor. +.PP +The physical characteristics of each disk include +the number of blocks per track and the rate at which +the disk spins. +The allocation routines use this information to calculate +the number of milliseconds required to skip over a block. +The characteristics of the processor include +the expected time to service an interrupt and schedule a +new disk transfer. +Given a block allocated to a file, +the allocation routines calculate the number of blocks to +skip over so that the next block in the file will +come into position under the disk head in the expected +amount of time that it takes to start a new +disk transfer operation. +For programs that sequentially access large amounts of data, +this strategy minimizes the amount of time spent waiting for +the disk to position itself. +.PP +To ease the calculation of finding rotationally optimal blocks, +the cylinder group summary information includes +a count of the available blocks in a cylinder +group at different rotational positions. +Eight rotational positions are distinguished, +so the resolution of the +summary information is 2 milliseconds for a typical 3600 +revolution per minute drive. +The super-block contains a vector of lists called +.I "rotational layout tables". +The vector is indexed by rotational position. +Each component of the vector +lists the index into the block map for every data block contained +in its rotational position. +When looking for an allocatable block, +the system first looks through the summary counts for a rotational +position with a non-zero block count. +It then uses the index of the rotational position to find the appropriate +list to use to index through +only the relevant parts of the block map to find a free block. +.PP +The parameter that defines the +minimum number of milliseconds between the completion of a data +transfer and the initiation of +another data transfer on the same cylinder +can be changed at any time, +even when the file system is mounted and active. +If a file system is parameterized to lay out blocks with +a rotational separation of 2 milliseconds, +and the disk pack is then moved to a system that has a +processor requiring 4 milliseconds to schedule a disk operation, +the throughput will drop precipitously because of lost disk revolutions +on nearly every block. +If the eventual target machine is known, +the file system can be parameterized for it +even though it is initially created on a different processor. +Even if the move is not known in advance, +the rotational layout delay can be reconfigured after the disk is moved +so that all further allocation is done based on the +characteristics of the new host. +.NH 2 +Layout policies +.PP +The file system layout policies are divided into two distinct parts. +At the top level are global policies that use file system +wide summary information to make decisions regarding +the placement of new inodes and data blocks. +These routines are responsible for deciding the +placement of new directories and files. +They also calculate rotationally optimal block layouts, +and decide when to force a long seek to a new cylinder group +because there are insufficient blocks left +in the current cylinder group to do reasonable layouts. +Below the global policy routines are +the local allocation routines that use a locally optimal scheme to +lay out data blocks. +.PP +Two methods for improving file system performance are to increase +the locality of reference to minimize seek latency +as described by [Trivedi80], and +to improve the layout of data to make larger transfers possible +as described by [Nevalainen77]. +The global layout policies try to improve performance +by clustering related information. +They cannot attempt to localize all data references, +but must also try to spread unrelated data +among different cylinder groups. +If too much localization is attempted, +the local cylinder group may run out of space +forcing the data to be scattered to non-local cylinder groups. +Taken to an extreme, +total localization can result in a single huge cluster of data +resembling the old file system. +The global policies try to balance the two conflicting +goals of localizing data that is concurrently accessed +while spreading out unrelated data. +.PP +One allocatable resource is inodes. +Inodes are used to describe both files and directories. +Inodes of files in the same directory are frequently accessed together. +For example, the ``list directory'' command often accesses +the inode for each file in a directory. +The layout policy tries to place all the inodes of +files in a directory in the same cylinder group. +To ensure that files are distributed throughout the disk, +a different policy is used for directory allocation. +A new directory is placed in a cylinder group that has a greater +than average number of free inodes, +and the smallest number of directories already in it. +The intent of this policy is to allow the inode clustering policy +to succeed most of the time. +The allocation of inodes within a cylinder group is done using a +next free strategy. +Although this allocates the inodes randomly within a cylinder group, +all the inodes for a particular cylinder group can be read with +8 to 16 disk transfers. +(At most 16 disk transfers are required because a cylinder +group may have no more than 2048 inodes.) +This puts a small and constant upper bound on the number of +disk transfers required to access the inodes +for all the files in a directory. +In contrast, the old file system typically requires +one disk transfer to fetch the inode for each file in a directory. +.PP +The other major resource is data blocks. +Since data blocks for a file are typically accessed together, +the policy routines try to place all data +blocks for a file in the same cylinder group, +preferably at rotationally optimal positions in the same cylinder. +The problem with allocating all the data blocks +in the same cylinder group is that large files will +quickly use up available space in the cylinder group, +forcing a spill over to other areas. +Further, using all the space in a cylinder group +causes future allocations for any file in the cylinder group +to also spill to other areas. +Ideally none of the cylinder groups should ever become completely full. +The heuristic solution chosen is to +redirect block allocation +to a different cylinder group +when a file exceeds 48 kilobytes, +and at every megabyte thereafter.* +.FS +* The first spill over point at 48 kilobytes is the point +at which a file on a 4096 byte block file system first +requires a single indirect block. This appears to be +a natural first point at which to redirect block allocation. +The other spillover points are chosen with the intent of +forcing block allocation to be redirected when a +file has used about 25% of the data blocks in a cylinder group. +In observing the new file system in day to day use, the heuristics appear +to work well in minimizing the number of completely filled +cylinder groups. +.FE +The newly chosen cylinder group is selected from those cylinder +groups that have a greater than average number of free blocks left. +Although big files tend to be spread out over the disk, +a megabyte of data is typically accessible before +a long seek must be performed, +and the cost of one long seek per megabyte is small. +.PP +The global policy routines call local allocation routines with +requests for specific blocks. +The local allocation routines will +always allocate the requested block +if it is free, otherwise it +allocates a free block of the requested size that is +rotationally closest to the requested block. +If the global layout policies had complete information, +they could always request unused blocks and +the allocation routines would be reduced to simple bookkeeping. +However, maintaining complete information is costly; +thus the implementation of the global layout policy +uses heuristics that employ only partial information. +.PP +If a requested block is not available, the local allocator uses +a four level allocation strategy: +.IP 1) +Use the next available block rotationally closest +to the requested block on the same cylinder. It is assumed +here that head switching time is zero. On disk +controllers where this is not the case, it may be possible +to incorporate the time required to switch between disk platters +when constructing the rotational layout tables. This, however, +has not yet been tried. +.IP 2) +If there are no blocks available on the same cylinder, +use a block within the same cylinder group. +.IP 3) +If that cylinder group is entirely full, +quadratically hash the cylinder group number to choose +another cylinder group to look for a free block. +.IP 4) +Finally if the hash fails, apply an exhaustive search +to all cylinder groups. +.PP +Quadratic hash is used because of its speed in finding +unused slots in nearly full hash tables [Knuth75]. +File systems that are parameterized to maintain at least +10% free space rarely use this strategy. +File systems that are run without maintaining any free +space typically have so few free blocks that almost any +allocation is random; +the most important characteristic of +the strategy used under such conditions is that the strategy be fast. +.ds RH Performance +.sp 2 +.ne 1i diff --git a/share/doc/smm/05.fastfs/4.t b/share/doc/smm/05.fastfs/4.t new file mode 100644 index 00000000000..15e3923dfbe --- /dev/null +++ b/share/doc/smm/05.fastfs/4.t @@ -0,0 +1,252 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)4.t 8.1 (Berkeley) 6/8/93 +.\" +.ds RH Performance +.NH +Performance +.PP +Ultimately, the proof of the effectiveness of the +algorithms described in the previous section +is the long term performance of the new file system. +.PP +Our empirical studies have shown that the inode layout policy has +been effective. +When running the ``list directory'' command on a large directory +that itself contains many directories (to force the system +to access inodes in multiple cylinder groups), +the number of disk accesses for inodes is cut by a factor of two. +The improvements are even more dramatic for large directories +containing only files, +disk accesses for inodes being cut by a factor of eight. +This is most encouraging for programs such as spooling daemons that +access many small files, +since these programs tend to flood the +disk request queue on the old file system. +.PP +Table 2 summarizes the measured throughput of the new file system. +Several comments need to be made about the conditions under which these +tests were run. +The test programs measure the rate at which user programs can transfer +data to or from a file without performing any processing on it. +These programs must read and write enough data to +insure that buffering in the +operating system does not affect the results. +They are also run at least three times in succession; +the first to get the system into a known state +and the second two to insure that the +experiment has stabilized and is repeatable. +The tests used and their results are +discussed in detail in [Kridle83]\(dg. +.FS +\(dg A UNIX command that is similar to the reading test that we used is +``cp file /dev/null'', where ``file'' is eight megabytes long. +.FE +The systems were running multi-user but were otherwise quiescent. +There was no contention for either the CPU or the disk arm. +The only difference between the UNIBUS and MASSBUS tests +was the controller. +All tests used an AMPEX Capricorn 330 megabyte Winchester disk. +As Table 2 shows, all file system test runs were on a VAX 11/750. +All file systems had been in production use for at least +a month before being measured. +The same number of system calls were performed in all tests; +the basic system call overhead was a negligible portion of +the total running time of the tests. +.KF +.DS B +.TS +box; +c c|c s s +c c|c c c. +Type of Processor and Read +File System Bus Measured Speed Bandwidth % CPU +_ +old 1024 750/UNIBUS 29 Kbytes/sec 29/983 3% 11% +new 4096/1024 750/UNIBUS 221 Kbytes/sec 221/983 22% 43% +new 8192/1024 750/UNIBUS 233 Kbytes/sec 233/983 24% 29% +new 4096/1024 750/MASSBUS 466 Kbytes/sec 466/983 47% 73% +new 8192/1024 750/MASSBUS 466 Kbytes/sec 466/983 47% 54% +.TE +.ce 1 +Table 2a \- Reading rates of the old and new UNIX file systems. +.TS +box; +c c|c s s +c c|c c c. +Type of Processor and Write +File System Bus Measured Speed Bandwidth % CPU +_ +old 1024 750/UNIBUS 48 Kbytes/sec 48/983 5% 29% +new 4096/1024 750/UNIBUS 142 Kbytes/sec 142/983 14% 43% +new 8192/1024 750/UNIBUS 215 Kbytes/sec 215/983 22% 46% +new 4096/1024 750/MASSBUS 323 Kbytes/sec 323/983 33% 94% +new 8192/1024 750/MASSBUS 466 Kbytes/sec 466/983 47% 95% +.TE +.ce 1 +Table 2b \- Writing rates of the old and new UNIX file systems. +.DE +.KE +.PP +Unlike the old file system, +the transfer rates for the new file system do not +appear to change over time. +The throughput rate is tied much more strongly to the +amount of free space that is maintained. +The measurements in Table 2 were based on a file system +with a 10% free space reserve. +Synthetic work loads suggest that throughput deteriorates +to about half the rates given in Table 2 when the file +systems are full. +.PP +The percentage of bandwidth given in Table 2 is a measure +of the effective utilization of the disk by the file system. +An upper bound on the transfer rate from the disk is calculated +by multiplying the number of bytes on a track by the number +of revolutions of the disk per second. +The bandwidth is calculated by comparing the data rates +the file system is able to achieve as a percentage of this rate. +Using this metric, the old file system is only +able to use about 3\-5% of the disk bandwidth, +while the new file system uses up to 47% +of the bandwidth. +.PP +Both reads and writes are faster in the new system than in the old system. +The biggest factor in this speedup is because of the larger +block size used by the new file system. +The overhead of allocating blocks in the new system is greater +than the overhead of allocating blocks in the old system, +however fewer blocks need to be allocated in the new system +because they are bigger. +The net effect is that the cost per byte allocated is about +the same for both systems. +.PP +In the new file system, the reading rate is always at least +as fast as the writing rate. +This is to be expected since the kernel must do more work when +allocating blocks than when simply reading them. +Note that the write rates are about the same +as the read rates in the 8192 byte block file system; +the write rates are slower than the read rates in the 4096 byte block +file system. +The slower write rates occur because +the kernel has to do twice as many disk allocations per second, +making the processor unable to keep up with the disk transfer rate. +.PP +In contrast the old file system is about 50% +faster at writing files than reading them. +This is because the write system call is asynchronous and +the kernel can generate disk transfer +requests much faster than they can be serviced, +hence disk transfers queue up in the disk buffer cache. +Because the disk buffer cache is sorted by minimum seek distance, +the average seek between the scheduled disk writes is much +less than it would be if the data blocks were written out +in the random disk order in which they are generated. +However when the file is read, +the read system call is processed synchronously so +the disk blocks must be retrieved from the disk in the +non-optimal seek order in which they are requested. +This forces the disk scheduler to do long +seeks resulting in a lower throughput rate. +.PP +In the new system the blocks of a file are more optimally +ordered on the disk. +Even though reads are still synchronous, +the requests are presented to the disk in a much better order. +Even though the writes are still asynchronous, +they are already presented to the disk in minimum seek +order so there is no gain to be had by reordering them. +Hence the disk seek latencies that limited the old file system +have little effect in the new file system. +The cost of allocation is the factor in the new system that +causes writes to be slower than reads. +.PP +The performance of the new file system is currently +limited by memory to memory copy operations +required to move data from disk buffers in the +system's address space to data buffers in the user's +address space. These copy operations account for +about 40% of the time spent performing an input/output operation. +If the buffers in both address spaces were properly aligned, +this transfer could be performed without copying by +using the VAX virtual memory management hardware. +This would be especially desirable when transferring +large amounts of data. +We did not implement this because it would change the +user interface to the file system in two major ways: +user programs would be required to allocate buffers on page boundaries, +and data would disappear from buffers after being written. +.PP +Greater disk throughput could be achieved by rewriting the disk drivers +to chain together kernel buffers. +This would allow contiguous disk blocks to be read +in a single disk transaction. +Many disks used with UNIX systems contain either +32 or 48 512 byte sectors per track. +Each track holds exactly two or three 8192 byte file system blocks, +or four or six 4096 byte file system blocks. +The inability to use contiguous disk blocks +effectively limits the performance +on these disks to less than 50% of the available bandwidth. +If the next block for a file cannot be laid out contiguously, +then the minimum spacing to the next allocatable +block on any platter is between a sixth and a half a revolution. +The implication of this is that the best possible layout without +contiguous blocks uses only half of the bandwidth of any given track. +If each track contains an odd number of sectors, +then it is possible to resolve the rotational delay to any number of sectors +by finding a block that begins at the desired +rotational position on another track. +The reason that block chaining has not been implemented is because it +would require rewriting all the disk drivers in the system, +and the current throughput rates are already limited by the +speed of the available processors. +.PP +Currently only one block is allocated to a file at a time. +A technique used by the DEMOS file system +when it finds that a file is growing rapidly, +is to preallocate several blocks at once, +releasing them when the file is closed if they remain unused. +By batching up allocations, the system can reduce the +overhead of allocating at each write, +and it can cut down on the number of disk writes needed to +keep the block pointers on the disk +synchronized with the block allocation [Powell79]. +This technique was not included because block allocation +currently accounts for less than 10% of the time spent in +a write system call and, once again, the +current throughput rates are already limited by the speed +of the available processors. +.ds RH Functional enhancements +.sp 2 +.ne 1i diff --git a/share/doc/smm/05.fastfs/5.t b/share/doc/smm/05.fastfs/5.t new file mode 100644 index 00000000000..96d721ac20f --- /dev/null +++ b/share/doc/smm/05.fastfs/5.t @@ -0,0 +1,293 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)5.t 8.1 (Berkeley) 6/8/93 +.\" +.ds RH Functional enhancements +.NH +File system functional enhancements +.PP +The performance enhancements to the +UNIX file system did not require +any changes to the semantics or +data structures visible to application programs. +However, several changes had been generally desired for some +time but had not been introduced because they would require users to +dump and restore all their file systems. +Since the new file system already +required all existing file systems to +be dumped and restored, +these functional enhancements were introduced at this time. +.NH 2 +Long file names +.PP +File names can now be of nearly arbitrary length. +Only programs that read directories are affected by this change. +To promote portability to UNIX systems that +are not running the new file system, a set of directory +access routines have been introduced to provide a consistent +interface to directories on both old and new systems. +.PP +Directories are allocated in 512 byte units called chunks. +This size is chosen so that each allocation can be transferred +to disk in a single operation. +Chunks are broken up into variable length records termed +directory entries. A directory entry +contains the information necessary to map the name of a +file to its associated inode. +No directory entry is allowed to span multiple chunks. +The first three fields of a directory entry are fixed length +and contain: an inode number, the size of the entry, and the length +of the file name contained in the entry. +The remainder of an entry is variable length and contains +a null terminated file name, padded to a 4 byte boundary. +The maximum length of a file name in a directory is +currently 255 characters. +.PP +Available space in a directory is recorded by having +one or more entries accumulate the free space in their +entry size fields. This results in directory entries +that are larger than required to hold the +entry name plus fixed length fields. Space allocated +to a directory should always be completely accounted for +by totaling up the sizes of its entries. +When an entry is deleted from a directory, +its space is returned to a previous entry +in the same directory chunk by increasing the size of the +previous entry by the size of the deleted entry. +If the first entry of a directory chunk is free, then +the entry's inode number is set to zero to indicate +that it is unallocated. +.NH 2 +File locking +.PP +The old file system had no provision for locking files. +Processes that needed to synchronize the updates of a +file had to use a separate ``lock'' file. +A process would try to create a ``lock'' file. +If the creation succeeded, then the process +could proceed with its update; +if the creation failed, then the process would wait and try again. +This mechanism had three drawbacks. +Processes consumed CPU time by looping over attempts to create locks. +Locks left lying around because of system crashes had +to be manually removed (normally in a system startup command script). +Finally, processes running as system administrator +are always permitted to create files, +so were forced to use a different mechanism. +While it is possible to get around all these problems, +the solutions are not straight forward, +so a mechanism for locking files has been added. +.PP +The most general schemes allow multiple processes +to concurrently update a file. +Several of these techniques are discussed in [Peterson83]. +A simpler technique is to serialize access to a file with locks. +To attain reasonable efficiency, +certain applications require the ability to lock pieces of a file. +Locking down to the byte level has been implemented in the +Onyx file system by [Bass81]. +However, for the standard system applications, +a mechanism that locks at the granularity of a file is sufficient. +.PP +Locking schemes fall into two classes, +those using hard locks and those using advisory locks. +The primary difference between advisory locks and hard locks is the +extent of enforcement. +A hard lock is always enforced when a program tries to +access a file; +an advisory lock is only applied when it is requested by a program. +Thus advisory locks are only effective when all programs accessing +a file use the locking scheme. +With hard locks there must be some override +policy implemented in the kernel. +With advisory locks the policy is left to the user programs. +In the UNIX system, programs with system administrator +privilege are allowed override any protection scheme. +Because many of the programs that need to use locks must +also run as the system administrator, +we chose to implement advisory locks rather than +create an additional protection scheme that was inconsistent +with the UNIX philosophy or could +not be used by system administration programs. +.PP +The file locking facilities allow cooperating programs to apply +advisory +.I shared +or +.I exclusive +locks on files. +Only one process may have an exclusive +lock on a file while multiple shared locks may be present. +Both shared and exclusive locks cannot be present on +a file at the same time. +If any lock is requested when +another process holds an exclusive lock, +or an exclusive lock is requested when another process holds any lock, +the lock request will block until the lock can be obtained. +Because shared and exclusive locks are advisory only, +even if a process has obtained a lock on a file, +another process may access the file. +.PP +Locks are applied or removed only on open files. +This means that locks can be manipulated without +needing to close and reopen a file. +This is useful, for example, when a process wishes +to apply a shared lock, read some information +and determine whether an update is required, then +apply an exclusive lock and update the file. +.PP +A request for a lock will cause a process to block if the lock +can not be immediately obtained. +In certain instances this is unsatisfactory. +For example, a process that +wants only to check if a lock is present would require a separate +mechanism to find out this information. +Consequently, a process may specify that its locking +request should return with an error if a lock can not be immediately +obtained. +Being able to conditionally request a lock +is useful to ``daemon'' processes +that wish to service a spooling area. +If the first instance of the +daemon locks the directory where spooling takes place, +later daemon processes can +easily check to see if an active daemon exists. +Since locks exist only while the locking processes exist, +lock files can never be left active after +the processes exit or if the system crashes. +.PP +Almost no deadlock detection is attempted. +The only deadlock detection done by the system is that the file +to which a lock is applied must not already have a +lock of the same type (i.e. the second of two successive calls +to apply a lock of the same type will fail). +.NH 2 +Symbolic links +.PP +The traditional UNIX file system allows multiple +directory entries in the same file system +to reference a single file. Each directory entry +``links'' a file's name to an inode and its contents. +The link concept is fundamental; +inodes do not reside in directories, but exist separately and +are referenced by links. +When all the links to an inode are removed, +the inode is deallocated. +This style of referencing an inode does +not allow references across physical file +systems, nor does it support inter-machine linkage. +To avoid these limitations +.I "symbolic links" +similar to the scheme used by Multics [Feiertag71] have been added. +.PP +A symbolic link is implemented as a file that contains a pathname. +When the system encounters a symbolic link while +interpreting a component of a pathname, +the contents of the symbolic link is prepended to the rest +of the pathname, and this name is interpreted to yield the +resulting pathname. +In UNIX, pathnames are specified relative to the root +of the file system hierarchy, or relative to a process's +current working directory. Pathnames specified relative +to the root are called absolute pathnames. Pathnames +specified relative to the current working directory are +termed relative pathnames. +If a symbolic link contains an absolute pathname, +the absolute pathname is used, +otherwise the contents of the symbolic link is evaluated +relative to the location of the link in the file hierarchy. +.PP +Normally programs do not want to be aware that there is a +symbolic link in a pathname that they are using. +However certain system utilities +must be able to detect and manipulate symbolic links. +Three new system calls provide the ability to detect, read, and write +symbolic links; seven system utilities required changes +to use these calls. +.PP +In future Berkeley software distributions +it may be possible to reference file systems located on +remote machines using pathnames. When this occurs, +it will be possible to create symbolic links that span machines. +.NH 2 +Rename +.PP +Programs that create a new version of an existing +file typically create the +new version as a temporary file and then rename the temporary file +with the name of the target file. +In the old UNIX file system renaming required three calls to the system. +If a program were interrupted or the system crashed between these calls, +the target file could be left with only its temporary name. +To eliminate this possibility the \fIrename\fP system call +has been added. The rename call does the rename operation +in a fashion that guarantees the existence of the target name. +.PP +Rename works both on data files and directories. +When renaming directories, +the system must do special validation checks to insure +that the directory tree structure is not corrupted by the creation +of loops or inaccessible directories. +Such corruption would occur if a parent directory were moved +into one of its descendants. +The validation check requires tracing the descendents of the target +directory to insure that it does not include the directory being moved. +.NH 2 +Quotas +.PP +The UNIX system has traditionally attempted to share all available +resources to the greatest extent possible. +Thus any single user can allocate all the available space +in the file system. +In certain environments this is unacceptable. +Consequently, a quota mechanism has been added for restricting the +amount of file system resources that a user can obtain. +The quota mechanism sets limits on both the number of inodes +and the number of disk blocks that a user may allocate. +A separate quota can be set for each user on each file system. +Resources are given both a hard and a soft limit. +When a program exceeds a soft limit, +a warning is printed on the users terminal; +the offending program is not terminated +unless it exceeds its hard limit. +The idea is that users should stay below their soft limit between +login sessions, +but they may use more resources while they are actively working. +To encourage this behavior, +users are warned when logging in if they are over +any of their soft limits. +If users fails to correct the problem for too many login sessions, +they are eventually reprimanded by having their soft limit +enforced as their hard limit. +.ds RH Acknowledgements +.sp 2 +.ne 1i diff --git a/share/doc/smm/05.fastfs/6.t b/share/doc/smm/05.fastfs/6.t new file mode 100644 index 00000000000..40be6aaccbc --- /dev/null +++ b/share/doc/smm/05.fastfs/6.t @@ -0,0 +1,159 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)6.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.ds RH Acknowledgements +.SH +\s+2Acknowledgements\s0 +.PP +We thank Robert Elz for his ongoing interest in the new file system, +and for adding disk quotas in a rational and efficient manner. +We also acknowledge Dennis Ritchie for his suggestions +on the appropriate modifications to the user interface. +We appreciate Michael Powell's explanations on how +the DEMOS file system worked; +many of his ideas were used in this implementation. +Special commendation goes to Peter Kessler and Robert Henry for acting +like real users during the early debugging stage when file systems were +less stable than they should have been. +The criticisms and suggestions by the reviews contributed significantly +to the coherence of the paper. +Finally we thank our sponsors, +the National Science Foundation under grant MCS80-05144, +and the Defense Advance Research Projects Agency (DoD) under +ARPA Order No. 4031 monitored by Naval Electronic System Command under +Contract No. N00039-82-C-0235. +.ds RH References +.nr H2 1 +.sp 2 +.SH +\s+2References\s0 +.LP +.IP [Almes78] 20 +Almes, G., and Robertson, G. +"An Extensible File System for Hydra" +Proceedings of the Third International Conference on Software Engineering, +IEEE, May 1978. +.IP [Bass81] 20 +Bass, J. +"Implementation Description for File Locking", +Onyx Systems Inc, 73 E. Trimble Rd, San Jose, CA 95131 +Jan 1981. +.IP [Feiertag71] 20 +Feiertag, R. J. and Organick, E. I., +"The Multics Input-Output System", +Proceedings of the Third Symposium on Operating Systems Principles, +ACM, Oct 1971. pp 35-41 +.IP [Ferrin82a] 20 +Ferrin, T.E., +"Performance and Robustness Improvements in Version 7 UNIX", +Computer Graphics Laboratory Technical Report 2, +School of Pharmacy, University of California, +San Francisco, January 1982. +Presented at the 1982 Winter Usenix Conference, Santa Monica, California. +.IP [Ferrin82b] 20 +Ferrin, T.E., +"Performance Issuses of VMUNIX Revisited", +;login: (The Usenix Association Newsletter), Vol 7, #5, November 1982. pp 3-6 +.IP [Kridle83] 20 +Kridle, R., and McKusick, M., +"Performance Effects of Disk Subsystem Choices for +VAX Systems Running 4.2BSD UNIX", +Computer Systems Research Group, Dept of EECS, Berkeley, CA 94720, +Technical Report #8. +.IP [Kowalski78] 20 +Kowalski, T. +"FSCK - The UNIX System Check Program", +Bell Laboratory, Murray Hill, NJ 07974. March 1978 +.IP [Knuth75] 20 +Kunth, D. +"The Art of Computer Programming", +Volume 3 - Sorting and Searching, +Addison-Wesley Publishing Company Inc, Reading, Mass, 1975. pp 506-549 +.IP [Maruyama76] +Maruyama, K., and Smith, S. +"Optimal reorganization of Distributed Space Disk Files", +CACM, 19, 11. Nov 1976. pp 634-642 +.IP [Nevalainen77] 20 +Nevalainen, O., Vesterinen, M. +"Determining Blocking Factors for Sequential Files by Heuristic Methods", +The Computer Journal, 20, 3. Aug 1977. pp 245-247 +.IP [Pechura83] 20 +Pechura, M., and Schoeffler, J. +"Estimating File Access Time of Floppy Disks", +CACM, 26, 10. Oct 1983. pp 754-763 +.IP [Peterson83] 20 +Peterson, G. +"Concurrent Reading While Writing", +ACM Transactions on Programming Languages and Systems, +ACM, 5, 1. Jan 1983. pp 46-55 +.IP [Powell79] 20 +Powell, M. +"The DEMOS File System", +Proceedings of the Sixth Symposium on Operating Systems Principles, +ACM, Nov 1977. pp 33-42 +.IP [Ritchie74] 20 +Ritchie, D. M. and Thompson, K., +"The UNIX Time-Sharing System", +CACM 17, 7. July 1974. pp 365-375 +.IP [Smith81a] 20 +Smith, A. +"Input/Output Optimization and Disk Architectures: A Survey", +Performance and Evaluation 1. Jan 1981. pp 104-117 +.IP [Smith81b] 20 +Smith, A. +"Bibliography on File and I/O System Optimization and Related Topics", +Operating Systems Review, 15, 4. Oct 1981. pp 39-54 +.IP [Symbolics81] 20 +"Symbolics File System", +Symbolics Inc, 9600 DeSoto Ave, Chatsworth, CA 91311 +Aug 1981. +.IP [Thompson78] 20 +Thompson, K. +"UNIX Implementation", +Bell System Technical Journal, 57, 6, part 2. pp 1931-1946 +July-August 1978. +.IP [Thompson80] 20 +Thompson, M. +"Spice File System", +Carnegie-Mellon University, +Department of Computer Science, Pittsburg, PA 15213 +#CMU-CS-80, Sept 1980. +.IP [Trivedi80] 20 +Trivedi, K. +"Optimal Selection of CPU Speed, Device Capabilities, and File Assignments", +Journal of the ACM, 27, 3. July 1980. pp 457-473 +.IP [White80] 20 +White, R. M. +"Disk Storage Technology", +Scientific American, 243(2), August 1980. diff --git a/share/doc/smm/05.fastfs/Makefile b/share/doc/smm/05.fastfs/Makefile new file mode 100644 index 00000000000..059887e9b45 --- /dev/null +++ b/share/doc/smm/05.fastfs/Makefile @@ -0,0 +1,10 @@ +# @(#)Makefile 8.1 (Berkeley) 6/8/93 + +DIR= smm/05.fastfs +SRCS= 0.t 1.t 2.t 3.t 4.t 5.t 6.t +MACROS= -ms + +paper.ps: ${SRCS} + ${TBL} ${SRCS} | ${EQN} | ${ROFF} > ${.TARGET} + +.include <bsd.doc.mk> diff --git a/share/doc/smm/06.nfs/0.t b/share/doc/smm/06.nfs/0.t new file mode 100644 index 00000000000..4d77f560e2a --- /dev/null +++ b/share/doc/smm/06.nfs/0.t @@ -0,0 +1,75 @@ +.\" Copyright (c) 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" This document is derived from software contributed to Berkeley by +.\" Rick Macklem at The University of Guelph. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)0.t 8.1 (Berkeley) 6/8/93 +.\" +.(l C +.sz 14 +.b "The 4.4BSD NFS Implementation" +.sp +.sz 10 +Rick Macklem +.i "University of Guelph" +.)l +.sp 2 +.ce 1 +.sz 12 +.b "ABSTRACT" +.eh 'SMM:06-%''The 4.4BSD NFS Implementation' +.oh 'The 4.4BSD NFS Implementation''SMM:06-%' +.pp +The 4.4BSD implementation of the Network File System (NFS)\** is +intended to interoperate with +.(f +\**Network File System (NFS) is believed to be a registered trademark of +Sun Microsystems Inc. +.)f +other NFS Version 2 Protocol (RFC1094) implementations but also +allows use of an alternate protocol that is hoped to provide better +performance in certain environments. +This paper will informally discuss these various protocol features and +their use. +There is a brief overview of the implementation followed +by several sections on various problem areas related to NFS +and some hints on how to deal with them. +.pp +Not Quite NFS (NQNFS) is an NFS like protocol designed to maintain full cache +consistency between clients in a crash tolerant manner. It is an adaptation +of the NFS protocol such that the server supports both NFS +and NQNFS clients while maintaining full consistency between the server and +NQNFS clients. +It borrows heavily from work done on Spritely-NFS [Srinivasan89], but uses +Leases [Gray89] to avoid the need to recover server state information +after a crash. +.sp diff --git a/share/doc/smm/06.nfs/1.t b/share/doc/smm/06.nfs/1.t new file mode 100644 index 00000000000..6804ed1fb4a --- /dev/null +++ b/share/doc/smm/06.nfs/1.t @@ -0,0 +1,588 @@ +.\" Copyright (c) 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" This document is derived from software contributed to Berkeley by +.\" Rick Macklem at The University of Guelph. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)1.t 8.1 (Berkeley) 6/8/93 +.\" +.sh 1 "NFS Implementation" +.pp +The 4.4BSD implementation of NFS and the alternate protocol nicknamed +Not Quite NFS (NQNFS) are kernel resident, but make use of a few system +daemons. +The kernel implementation does not use an RPC library, handling the RPC +request and reply messages directly in \fImbuf\fR data areas. NFS +interfaces to the network using +sockets via. the kernel interface available in +\fIsys/kern/uipc_syscalls.c\fR as \fIsosend(), soreceive(),\fR... +There are connection management routines for support of sockets for connection +oriented protocols and timeout/retransmit support for datagram sockets on +the client side. +For connection oriented transport protocols, +such as TCP/IP, there is one connection +for each client to server mount point that is maintained until an umount. +If the connection breaks, the client will attempt a reconnect with a new +socket. +The client side can operate without any daemons running, but performance +will be improved by running nfsiod daemons that perform read-aheads +and write-behinds. +For the server side to function, the daemons portmap, mountd and +nfsd must be running. +The mountd daemon performs two important functions. +.ip 1) +Upon startup and after a hangup signal, mountd reads the exports +file and pushes the export information for each local file system down +into the kernel via. the mount system call. +.ip 2) +Mountd handles remote mount protocol (RFC1094, Appendix A) requests. +.lp +The nfsd master daemon forks off children that enter the kernel +via. the nfssvc system call. The children normally remain kernel +resident, providing a process context for the NFS RPC servers. The only +exception to this is when a Kerberos [Steiner88] +ticket is received and at that time +the nfsd exits the kernel temporarily to verify the ticket via. the +Kerberos libraries and then returns to the kernel with the results. +(This only happens for Kerberos mount points as described further under +Security.) +Meanwhile, the master nfsd waits to accept new connections from clients +using connection oriented transport protocols and passes the new sockets down +into the kernel. +The client side mount_nfs along with portmap and +mountd are the only parts of the NFS subsystem that make any +use of the Sun RPC library. +.sh 1 "Mount Problems" +.pp +There are several problems that can be encountered at the time of an NFS +mount, ranging from a unresponsive NFS server (crashed, network partitioned +from client, etc.) to various interoperability problems between different +NFS implementations. +.pp +On the server side, +if the 4.4BSD NFS server will be handling any PC clients, mountd will +require the \fB-n\fR option to enable non-root mount request servicing. +Running of a pcnfsd\** daemon will also be necessary. +.(f +\** Pcnfsd is available in source form from Sun Microsystems and many +anonymous ftp sites. +.)f +The server side requires that the daemons +mountd and nfsd be running and that +they be registered with portmap properly. +If problems are encountered, +the safest fix is to kill all the daemons and then restart them in +the order portmap, mountd and nfsd. +Other server side problems are normally caused by problems with the format +of the exports file, which is covered under +Security and in the exports man page. +.pp +On the client side, there are several mount options useful for dealing +with server problems. +In cases where a file system is not critical for system operation, the +\fB-b\fR +mount option may be specified so that mount_nfs will go into the +background for a mount attempt on an unresponsive server. +This is useful for mounts specified in +\fIfstab(5)\fR, +so that the system will not get hung while booting doing +\fBmount -a\fR +because a file server is not responsive. +On the other hand, if the file system is critical to system operation, this +option should not be used so that the client will wait for the server to +come up before completing bootstrapping. +There are also three mount options to help deal with interoperability issues +with various non-BSD NFS servers. The +\fB-P\fR +option specifies that the NFS +client use a reserved IP port number to satisfy some servers' security +requirements.\** +.(f +\**Any security benefit of this is highly questionable and as +such the BSD server does not require a client to use a reserved port number. +.)f +The +\fB-c\fR +option stops the NFS client from doing a \fIconnect\fR on the UDP +socket, so that the mount works with servers that send NFS replies from +port numbers other than the standard 2049.\** +.(f +\**The Encore Multimax is known +to require this. +.)f +Finally, the +\fB-g=\fInum\fR +option sets the maximum size of the group list in the credentials passed +to an NFS server in every RPC request. Although RFC1057 specifies a maximum +size of 16 for the group list, some servers can't handle that many. +If a user, particularly root doing a mount, +keeps getting access denied from a file server, try temporarily +reducing the number of groups that user is in to less than 5 +by editing /etc/group. If the user can then access the file system, slowly +increase the number of groups for that user until the limit is found and +then peg the limit there with the +\fB-g=\fInum\fR +option. +This implies that the server will only see the first \fInum\fR +groups that the user is in, which can cause some accessibility problems. +.pp +For sites that have many NFS servers, amd [Pendry93] +is a useful administration tool. +It also reduces the number of actual NFS mount points, alleviating problems +with commands such as df(1) that hang when any of the NFS servers is +unreachable. +.sh 1 "Dealing with Hung Servers" +.pp +There are several mount options available to help a client deal with +being hung waiting for response from a crashed or unreachable\** server. +.(f +\**Due to a network partitioning or similar. +.)f +By default, a hard mount will continue to try to contact the server +``forever'' to complete the system call. This type of mount is appropriate +when processes on the client that access files in the file system do not +tolerate file I/O systems calls that return -1 with \fIerrno == EINTR\fR +and/or access to the file system is critical for normal system operation. +.lp +There are two other alternatives: +.ip 1) +A soft mount (\fB-s\fR option) retries an RPC \fIn\fR +times and then the corresponding +system call returns -1 with errno set to EINTR. +For TCP transport, the actual RPC request is not retransmitted, but the +timeout intervals waiting for a reply from the server are done +in the same manner as UDP for this purpose. +The problem with this type of mount is that most applications do not +expect an EINTR error return from file I/O system calls (since it never +occurs for a local file system) and get confused by the error return +from the I/O system call. +The option +\fB-x=\fInum\fR +is used to set the RPC retry limit and if set too low, the error returns +will start occurring whenever the NFS server is slow due to heavy load. +Alternately, a large retry limit can result in a process hung for a long +time, due to a crashed server or network partitioning. +.ip 2) +An interruptible mount (\fB-i\fR option) checks to see if a termination signal +is pending for the process when waiting for server response and if it is, +the I/O system call posts an EINTR. Normally this results in the process +being terminated by the signal when returning from the system call. +This feature allows you to ``^C'' out of processes that are hung +due to unresponsive servers. +The problem with this approach is that signals that are caught by +a process are not recognized as termination signals +and the process will remain hung.\** +.(f +\**Unfortunately, there are also some resource allocation situations in the +BSD kernel where the termination signal will be ignored and the process +will not terminate. +.)f +.sh 1 "RPC Transport Issues" +.pp +The NFS Version 2 protocol runs over UDP/IP transport by +sending each Sun Remote Procedure Call (RFC1057) +request/reply message in a single UDP +datagram. Since UDP does not guarantee datagram delivery, the +Remote Procedure Call (RPC) layer +times out and retransmits an RPC request if +no RPC reply has been received. Since this round trip timeout (RTO) value +is for the entire RPC operation, including RPC message transmission to the +server, queuing at the server for an nfsd, performing the RPC and +sending the RPC reply message back to the client, it can be highly variable +for even a moderately loaded NFS server. +As a result, the RTO interval must be a conservation (large) estimate, in +order to avoid extraneous RPC request retransmits.\** +.(f +\**At best, an extraneous RPC request retransmit increases +the load on the server and at worst can result in damaged files +on the server when non-idempotent RPCs are redone [Juszczak89]. +.)f +Also, with an 8Kbyte read/write data size +(the default), the read/write reply/request will be an 8+Kbyte UDP datagram +that must normally be fragmented at the IP layer for transmission.\** +.(f +\**6 IP fragments for an Ethernet, +which has an maximum transmission unit of 1500bytes. +.)f +For IP fragments to be successfully reassembled into +the IP datagram at the receive end, all +fragments must be received within a fairly short ``time to live''. +If one fragment is lost/damaged in transit, +the entire RPC must be retransmitted and redone. +This problem can be exaggerated by a network interface on the receiver that +cannot handle the reception of back to back network packets. [Kent87a] +.pp +There are several tuning mount +options on the client side that can prove useful when trying to +alleviate performance problems related to UDP RPC transport. +The options +\fB-r=\fInum\fR +and +\fB-w=\fInum\fR +specify the maximum read or write data size respectively. +The size \fInum\fR +should be a power of 2 (4K, 2K, 1K) and adjusted downward from the +maximum of 8Kbytes +whenever IP fragmentation is causing problems. The best indicator of +IP fragmentation problems is a significant number of +\fIfragments dropped after timeout\fR +reported by the \fIip:\fR section of a \fBnetstat -s\fR +command on either the client or server. +Of course, if the fragments are being dropped at the server, it can be +fun figuring out which client(s) are involved. +The most likely candidates are clients that are not +on the same local area network as the +server or have network interfaces that do not receive several +back to back network packets properly. +.pp +By default, the 4.4BSD NFS client dynamically estimates the retransmit +timeout interval for the RPC and this appears to work reasonably well for +many environments. However, the +\fB-d\fR +flag can be specified to turn off +the dynamic estimation of retransmit timeout, so that the client will +use a static initial timeout interval.\** +.(f +\**After the first retransmit timeout, the initial interval is backed off +exponentially. +.)f +The +\fB-t=\fInum\fR +option can be used with +\fB-d\fR +to set the initial timeout interval to other than the default of 2 seconds. +The best indicator that dynamic estimation should be turned off would +be a significant number\** in the \fIX Replies\fR field and a +.(f +\**Even 0.1% of the total RPCs is probably significant. +.)f +large number in the \fIRetries\fR field +in the \fIRpc Info:\fR section as reported +by the \fBnfsstat\fR command. +On the server, there would be significant numbers of \fIInprog\fR recent +request cache hits in the \fIServer Cache Stats:\fR section as reported +by the \fBnfsstat\fR command, when run on the server. +.pp +The tradeoff is that a smaller timeout interval results in a better +average RPC response time, but increases the risk of extraneous retries +that in turn increase server load and the possibility of damaged files +on the server. It is probably best to err on the safe side and use a large +(>= 2sec) fixed timeout if the dynamic retransmit timeout estimation +seems to be causing problems. +.pp +An alternative to all this fiddling is to run NFS over TCP transport instead +of UDP. +Since the 4.4BSD TCP implementation provides reliable +delivery with congestion control, it avoids all of the above problems. +It also permits the use of read and write data sizes greater than the 8Kbyte +limit for UDP transport.\** +.(f +\**Read/write data sizes greater than 8Kbytes will not normally improve +performance unless the kernel constant MAXBSIZE is increased and the +file system on the server has a block size greater than 8Kbytes. +.)f +NFS over TCP usually delivers comparable to significantly better performance +than NFS over UDP +unless the client or server processor runs at less than 5-10MIPS. For a +slow processor, the extra CPU overhead of using TCP transport will become +significant and TCP transport may only be useful when the client +to server interconnect traverses congested gateways. +The main problem with using TCP transport is that it is only supported +between BSD clients and servers.\** +.(f +\**There are rumors of commercial NFS over TCP implementations on the horizon +and these may well be worth exploring. +.)f +.sh 1 "Other Tuning Tricks" +.pp +Another mount option that may improve performance over +certain network interconnects is \fB-a=\fInum\fR +which sets the number of blocks that the system will +attempt to read-ahead during sequential reading of a file. The default value +of 1 seems to be appropriate for most situations, but a larger value might +achieve better performance for some environments, such as a mount to a server +across a ``high bandwidth * round trip delay'' interconnect. +.pp +For the adventurous, playing with the size of the buffer cache +can also improve performance for some environments that use NFS heavily. +Under some workloads, a buffer cache of 4-6Mbytes can result in significant +performance improvements over 1-2Mbytes, both in client side system call +response time and reduced server RPC load. +The buffer cache size defaults to 10% of physical memory, +but this can be overridden by specifying the BUFPAGES option +in the machine's config file.\** +.(f +BUFPAGES is the number of physical machine pages allocated to the buffer cache. +ie. BUFPAGES * NBPG = buffer cache size in bytes +.)f +When increasing the size of BUFPAGES, it is also advisable to increase the +number of buffers NBUF by a corresponding amount. +Note that there is a tradeoff of memory allocated to the buffer cache versus +available for paging, which implies that making the buffer cache larger +will increase paging rate, with possibly disastrous results. +.sh 1 "Security Issues" +.pp +When a machine is running an NFS server it opens up a great big security hole. +For ordinary NFS, the server receives client credentials +in the RPC request as a user id +and a list of group ids and trusts them to be authentic! +The only tool available to restrict remote access to +file systems with is the exports(5) file, +so file systems should be exported with great care. +The exports file is read by mountd upon startup and after a hangup signal +is posted for it and then as much of the access specifications as possible are +pushed down into the kernel for use by the nfsd(s). +The trick here is that the kernel information is stored on a per +local file system mount point and client host address basis and cannot refer to +individual directories within the local server file system. +It is best to think of the exports file as referring to the various local +file systems and not just directory paths as mount points. +A local file system may be exported to a specific host, all hosts that +match a subnet mask or all other hosts (the world). The latter is very +dangerous and should only be used for public information. It is also +strongly recommended that file systems exported to ``the world'' be exported +read-only. +For each host or group of hosts, the file system can be exported read-only or +read/write. +You can also define one of three client user id to server credential +mappings to help control access. +Root (user id == 0) can be mapped to some default credentials while all other +user ids are accepted as given. +If the default credentials for user id equal zero +are root, then there is essentially no remapping. +Most NFS file systems are exported this way, most commonly mapping +user id == 0 to the credentials for the user nobody. +Since the client user id and group id list is used unchanged on the server +(except for root), this also implies that +the user id and group id space must be common between the client and server. +(ie. user id N on the client must refer to the same user on the server) +All user ids can be mapped to a default set of credentials, typically that of +the user nobody. This essentially gives world access to all +users on the corresponding hosts. +.pp +There is also a non-standard BSD +\fB-kerb\fR export option that requires the client provide +a KerberosIV rcmd service ticket to authenticate the user on the server. +If successful, the Kerberos principal is looked up in the server's password +and group databases to get a set of credentials and a map of client userid to +these credentials is then cached. +The use of TCP transport is strongly recommended, +since the scheme depends on the TCP connection to avert replay attempts. +Unfortunately, this option is only usable +between BSD clients and servers since it is +not compatible with other known ``kerberized'' NFS systems. +To enable use of this Kerberos option, both mount_nfs on the client and +nfsd on the server must be rebuilt with the -DKERBEROS option and +linked to KerberosIV libraries. +The file system is then exported to the client(s) with the \fB-kerb\fR option +in the exports file on the server +and the client mount specifies the +\fB-K\fR +and +\fB-T\fR +options. +The +\fB-m=\fIrealm\fR +mount option may be used to specify a Kerberos Realm for the ticket +(it must be the Kerberos Realm of the server) that is other than +the client's local Realm. +To access files in a \fB-kerb\fR mount point, the user must have a valid +TGT for the server's Realm, as provided by kinit or similar. +.pp +As well as the standard NFS Version 2 protocol (RFC1094) implementation, BSD +systems can use a variant of the protocol called Not Quite NFS (NQNFS) that +supports a variety of protocol extensions. +This protocol uses 64bit file offsets +and sizes, an \fIaccess rpc\fR, an \fIappend\fR option on the write rpc +and extended file attributes to support 4.4BSD file system functionality +more fully. +It also makes use of a variant of short term +\fIleases\fR [Gray89] with delayed write client caching, +in an effort to provide full cache consistency and better performance. +This protocol is available between 4.4BSD systems only and is used when +the \fB-q\fR mount option is specified. +It can be used with any of the aforementioned options for NFS, such as TCP +transport (\fB-T\fR) and KerberosIV authentication (\fB-K\fR). +Although this protocol is experimental, it is recommended over NFS for +mounts between 4.4BSD systems.\** +.(f +\**I would appreciate email from anyone who can provide +NFS vs. NQNFS performance measurements, +particularly fast clients, many clients or over an internetwork +connection with a large ``bandwidth * RTT'' product. +.)f +.sh 1 "Monitoring NFS Activity" +.pp +The basic command for monitoring NFS activity on clients and servers is +nfsstat. It reports cumulative statistics of various NFS activities, +such as counts of the various different RPCs and cache hit rates on the client +and server. Of particular interest on the server are the fields in the +\fIServer Cache Stats:\fR section, which gives numbers for RPC retries received +in the first three fields and total RPCs in the fourth. The first three fields +should remain a very small percentage of the total. If not, it +would indicate one or more clients doing retries too aggressively and the fix +would be to isolate these clients, +disable the dynamic RTO estimation on them and +make their initial timeout interval a conservative (ie. large) value. +.pp +On the client side, the fields in the \fIRpc Info:\fR section are of particular +interest, as they give an overall picture of NFS activity. +The \fITimedOut\fR field is the number of I/O system calls that returned -1 +for ``soft'' mounts and can be reduced +by increasing the retry limit or changing +the mount type to ``intr'' or ``hard''. +The \fIInvalid\fR field is a count of trashed RPC replies that are received +and should remain zero.\** +.(f +\**Some NFS implementations run with UDP checksums disabled, so garbage RPC +messages can be received. +.)f +The \fIX Replies\fR field counts the number of repeated RPC replies received +from the server and is a clear indication of a too aggressive RTO estimate. +Unfortunately, a good NFS server implementation will use a ``recent request +cache'' [Juszczak89] that will suppress the extraneous replies. +A large value for \fIRetries\fR indicates a problem, but +it could be any of: +.ip \(bu +a too aggressive RTO estimate +.ip \(bu +an overloaded NFS server +.ip \(bu +IP fragments being dropped (gateway, client or server) +.lp +and requires further investigation. +The \fIRequests\fR field is the total count of RPCs done on all servers. +.pp +The \fBnetstat -s\fR comes in useful during investigation of RPC transport +problems. +The field \fIfragments dropped after timeout\fR in +the \fIip:\fR section indicates IP fragments are +being lost and a significant number of these occurring indicates that the +use of TCP transport or a smaller read/write data size is in order. +A significant number of \fIbad checksums\fR reported in the \fIudp:\fR +section would suggest network problems of a more generic sort. +(cabling, transceiver or network hardware interface problems or similar) +.pp +There is a RPC activity logging facility for both the client and +server side in the kernel. +When logging is enabled by setting the kernel variable nfsrtton to +one, the logs in the kernel structures nfsrtt (for the client side) +and nfsdrt (for the server side) are updated upon the completion +of each RPC in a circular manner. +The pos element of the structure is the index of the next element +of the log array to be updated. +In other words, elements of the log array from \fIlog\fR[pos] to +\fIlog\fR[pos - 1] are in chronological order. +The include file <sys/nfsrtt.h> should be consulted for details on the +fields in the two log structures.\** +.(f +\**Unfortunately, a monitoring tool that uses these logs is still in the +planning (dreaming) stage. +.)f +.sh 1 "Diskless Client Support" +.pp +The NFS client does include kernel support for diskless/dataless operation +where the root file system and optionally the swap area is remote NFS mounted. +A diskless/dataless client is configured using a version of the +``swapvmunix.c'' file as provided in the directory \fIcontrib/diskless.nfs\fR. +If the swap device == NODEV, it specifies an NFS mounted swap area and should +be configured the same size as set up by diskless_setup when run on the server. +This file must be put in the \fIsys/compile/<machine_name>\fR kernel build +directory after the config command has been run, since config does +not know about specifying NFS root and swap areas. +The kernel variable mountroot must be set to nfs_mountroot instead of +ffs_mountroot and the kernel structure nfs_diskless must be filled in +properly. +There are some primitive system administration tools in the \fIcontrib/diskless.nfs\fR directory to assist in filling in +the nfs_diskless structure and in setting up an NFS server for +diskless/dataless clients. +The tools were designed to provide a bare bones capability, to allow maximum +flexibility when setting up different servers. +.lp +The tools are as follows: +.ip \(bu +diskless_offset.c - This little program reads a ``vmunix'' object file and +writes the file byte offset of the nfs_diskless structure in it to +standard out. It was kept separate because it sometimes has to +be compiled/linked in funny ways depending on the client architecture. +(See the comment at the beginning of it.) +.ip \(bu +diskless_setup.c - This program is run on the server and sets up files for a +given client. It mostly just fills in an nfs_diskless structure and +writes it out to either the "vmunix" file or a separate file called +/var/diskless/setup.<official-hostname> +.ip \(bu +diskless_boot.c - There are two functions in here that may be used +by a bootstrap server such as tftpd to permit sharing of the ``vmunix'' +object file for similar clients. This saves disk space on the bootstrap +server and simplify organization, but are not critical for correct operation. +They read the ``vmunix'' +file, but optionally fill in the nfs_diskless structure from a +separate "setup.<official-hostname>" file so that there is only +one copy of "vmunix" for all similar (same arch etc.) clients. +These functions use a text file called +/var/diskless/boot.<official-hostname> to control the netboot. +.lp +The basic setup steps are: +.ip \(bu +make a "vmunix" for the client(s) with mountroot() == nfs_mountroot() +and swdevt[0].sw_dev == NODEV if it is to do nfs swapping as well +(See the same swapvmunix.c file) +.ip \(bu +run diskless_offset on the vmunix file to find out the byte offset +of the nfs_diskless structure +.ip \(bu +Run diskless_setup on the server to set up the server and fill in the +nfs_diskless structure for that client. +The nfs_diskless structure can either be written into the +vmunix file (the -x option) or +saved in /var/diskless/setup.<official-hostname>. +.ip \(bu +Set up the bootstrap server. If the nfs_diskless structure was written into +the ``vmunix'' file, any vanilla bootstrap protocol such as bootp/tftp can +be used. If the bootstrap server has been modified to use the functions in +diskless_boot.c, then a +file called /var/diskless/boot.<official-hostname> +must be created. +It is simply a two line text file, where the first line is the pathname +of the correct ``vmunix'' file and the second line has the pathname of +the nfs_diskless structure file and its byte offset in it. +For example: +.br + /var/diskless/vmunix.pmax +.br + /var/diskless/setup.rickers.cis.uoguelph.ca 642308 +.br +.ip \(bu +Create a /var subtree for each client in an appropriate place on the server, +such as /var/diskless/var/<client-hostname>/... +By using the <client-hostname> to differentiate /var for each host, +/etc/rc can be modified to mount the correct /var from the server. diff --git a/share/doc/smm/06.nfs/2.t b/share/doc/smm/06.nfs/2.t new file mode 100644 index 00000000000..841dd5f9147 --- /dev/null +++ b/share/doc/smm/06.nfs/2.t @@ -0,0 +1,530 @@ +.\" Copyright (c) 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" This document is derived from software contributed to Berkeley by +.\" Rick Macklem at The University of Guelph. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)2.t 8.1 (Berkeley) 6/8/93 +.\" +.sh 1 "Not Quite NFS, Crash Tolerant Cache Consistency for NFS" +.pp +Not Quite NFS (NQNFS) is an NFS like protocol designed to maintain full cache +consistency between clients in a crash tolerant manner. +It is an adaptation of the NFS protocol such that the server supports both NFS +and NQNFS clients while maintaining full consistency between the server and +NQNFS clients. +This section borrows heavily from work done on Spritely-NFS [Srinivasan89], +but uses Leases [Gray89] to avoid the need to recover server state information +after a crash. +The reader is strongly encouraged to read these references before +trying to grasp the material presented here. +.sh 2 "Overview" +.pp +The protocol maintains cache consistency by using a somewhat +Sprite [Nelson88] like protocol, +but is based on short term leases\** instead of hard state information +about open files. +.(f +\** A lease is a ticket permitting an activity that is +valid until some expiry time. +.)f +The basic principal is that the protocol will disable client caching of a +file whenever that file is write shared\**. +.(f +\** Write sharing occurs when at least one client is modifying a file while +other client(s) are reading the file. +.)f +Whenever a client wishes to cache data for a file it must hold a valid lease. +There are three types of leases: read caching, write caching and non-caching. +The latter type requires that all file operations be done synchronously with +the server via. RPCs. +A read caching lease allows for client data caching, but no file modifications +may be done. +A write caching lease allows for client caching of writes, +but requires that all writes be pushed to the server when the lease expires. +If a client has dirty buffers\** +.(f +\** Cached write data is not yet pushed (written) to the server. +.)f +when a write cache lease has almost expired, it will attempt to +extend the lease but is required to push the dirty buffers if extension fails. +A client gets leases by either doing a \fBGetLease RPC\fR or by piggybacking +a \fBGetLease Request\fR onto another RPC. Piggybacking is supported for the +frequent RPCs Getattr, Setattr, Lookup, Readlink, Read, Write and Readdir +in an effort to minimize the number of \fBGetLease RPCs\fR required. +All leases are at the granularity of a file, since all NFS RPCs operate on +individual files and NFS has no intrinsic notion of a file hierarchy. +Directories, symbolic links and file attributes may be read cached but +are not write cached. +The exception here is the attribute file_size, which is updated during cached +writing on the client to reflect a growing file. +.pp +It is the server's responsibility to ensure that consistency is maintained +among the NQNFS clients by disabling client caching whenever a server file +operation would cause inconsistencies. +The possibility of inconsistencies occurs whenever a client has +a write caching lease and any other client, +or local operations on the server, +tries to access the file or when +a modify operation is attempted on a file being read cached by client(s). +At this time, the server sends an \fBeviction notice\fR to all clients holding +the lease and then waits for lease termination. +Lease termination occurs when a \fBvacated the premises\fR message has been +received from all the clients that have signed the lease or when the lease +expires via. timeout. +The message pair \fBeviction notice\fR and \fBvacated the premises\fR roughly +correspond to a Sprite server\(->client callback, but are not implemented as an +actual RPC, to avoid the server waiting indefinitely for a reply from a dead +client. +.pp +Server consistency checking can be viewed as issuing intrinsic leases for a +file operation for the duration of the operation only. For example, the +\fBCreate RPC\fR will get an intrinsic write lease on the directory in which +the file is being created, disabling client read caches for that directory. +.pp +By relegating this responsibility to the server, consistency between the +server and NQNFS clients is maintained when NFS clients are modifying the +file system as well.\** +.(f +\** The NFS clients will continue to be \fIapproximately\fR consistent with +the server. +.)f +.pp +The leases are issued as time intervals to avoid the requirement of time of day +clock synchronization. There are three important time constants known to +the server. The \fBmaximum_lease_term\fR sets an upper bound on lease duration. +The \fBclock_skew\fR is added to all lease terms on the server to correct for +differing clock speeds between the client and server and \fBwrite_slack\fR is +the number of seconds the server is willing to wait for a client with +an expired write caching lease to push dirty writes. +.pp +The server maintains a \fBmodify_revision\fR number for each file. It is +defined as a unsigned quadword integer that is never zero and that must +increase whenever the corresponding file is modified on the server. +It is used +by the client to determine whether or not cached data for the file is +stale. +Generating this value is easier said than done. The current implementation +uses the following technique, which is believed to be adequate. +The high order longword is stored in the ufs inode and is initialized to one +when an inode is first allocated. +The low order longword is stored in main memory only and is initialized to +zero when an inode is read in from disk. +When the file is modified for the first time within a given second of +wall clock time, the high order longword is incremented by one and +the low order longword reset to zero. +For subsequent modifications within the same second of wall clock +time, the low order longword is incremented. If the low order longword wraps +around to zero, the high order longword is incremented again. +Since the high order longword only increments once per second and the inode +is pushed to disk frequently during file modification, this implies +0 \(<= Current\(miDisk \(<= 5. +When the inode is read in from disk, 10 +is added to the high order longword, which ensures that the quadword +is greater than any value it could have had before a crash. +This introduces apparent modifications every time the inode falls out of +the LRU inode cache, but this should only reduce the client caching performance +by a (hopefully) small margin. +.sh 2 "Crash Recovery and other Failure Scenarios" +.pp +The server must maintain the state of all the current leases held by clients. +The nice thing about short term leases is that maximum_lease_term seconds +after the server stops issuing leases, there are no current leases left. +As such, server crash recovery does not require any state recovery. After +rebooting, the server refuses to service any RPCs except for writes until +write_slack seconds after the last lease would have expired\**. +.(f +\** The last lease expiry time may be safely estimated as +"boottime+maximum_lease_term+clock_skew" for machines that cannot store +it in nonvolatile RAM. +.)f +By then, the server would not have any outstanding leases to recover the +state of and the clients have had at least write_slack seconds to push dirty +writes to the server and get the server sync'd up to date. After this, the +server simply services requests in a manner similar to NFS. +In an effort to minimize the effect of "recovery storms" [Baker91], +the server replies \fBtry_again_later\fR to the RPCs it is not +yet ready to service. +.pp +After a client crashes, the server may have to wait for a lease to timeout +before servicing a request if write sharing of a file with a cachable lease +on the client is about to occur. +As for the client, it simply starts up getting any leases it now needs. Any +outstanding leases for that client on the server prior to the crash will either be renewed or expire +via timeout. +.pp +Certain network partitioning failures are more problematic. If a client to +server network connection is severed just before a write caching lease expires, +the client cannot push the dirty writes to the server. After the lease expires +on the server, the server permits other clients to access the file with the +potential of getting stale data. Unfortunately I believe this failure scenario +is intrinsic in any delay write caching scheme unless the server is required to +wait \fBforever\fR for a client to regain contact\**. +.(f +\** Gray and Cheriton avoid this problem by using a \fBwrite through\fR policy. +.)f +Since the write caching lease has expired on the client, +it will sync up with the +server as soon as the network connection has been re-established. +.pp +There is another failure condition that can occur when the server is congested. +The worst case scenario would have the client pushing dirty writes to the server +but a large request queue on the server delays these writes for more than +\fBwrite_slack\fR seconds. It is hoped that a congestion control scheme using +the \fBtry_again_later\fR RPC reply after booting combined with +the following lease termination rule for write caching leases +can minimize the risk of this occurrence. +A write caching lease is only terminated on the server when there are have +been no writes to the file and the server has not been overloaded during +the previous write_slack seconds. The server has not been overloaded +is approximated by a test for sleeping nfsd(s) at the end of the write_slack +period. +.sh 2 "Server Disk Full" +.pp +There is a serious unresolved problem for delayed write caching with respect to +server disk space allocation. +When the disk on the file server is full, delayed write RPCs can fail +due to "out of space". +For NFS, this occurrence results in an error return from the close system +call on the file, since the dirty blocks are pushed on close. +Processes writing important files can check for this error return +to ensure that the file was written successfully. +For NQNFS, the dirty blocks are not pushed on close and as such the client +may not attempt the write RPC until after the process has done the close +which implies no error return from the close. +For the current prototype, +the only solution is to modify programs writing important +file(s) to call fsync and check for an error return from it instead of close. +.sh 2 "Protocol Details" +.pp +The protocol specification is identical to that of NFS [Sun89] except for +the following changes. +.ip \(bu +RPC Information +.(l + Program Number 300105 + Version Number 1 +.)l +.ip \(bu +Readdir_and_Lookup RPC +.(l + struct readdirlookargs { + fhandle file; + nfscookie cookie; + unsigned count; + unsigned duration; + }; + + struct entry { + unsigned cachable; + unsigned duration; + modifyrev rev; + fhandle entry_fh; + nqnfs_fattr entry_attrib; + unsigned fileid; + filename name; + nfscookie cookie; + entry *nextentry; + }; + + union readdirlookres switch (stat status) { + case NFS_OK: + struct { + entry *entries; + bool eof; + } readdirlookok; + default: + void; + }; + + readdirlookres + NQNFSPROC_READDIRLOOK(readdirlookargs) = 18; +.)l +Reads entries in a directory in a manner analogous to the NFSPROC_READDIR RPC +in NFS, but returns the file handle and attributes of each entry as well. +This allows the attribute and lookup caches to be primed. +.ip \(bu +Get Lease RPC +.(l + struct getleaseargs { + fhandle file; + cachetype readwrite; + unsigned duration; + }; + + union getleaseres switch (stat status) { + case NFS_OK: + bool cachable; + unsigned duration; + modifyrev rev; + nqnfs_fattr attributes; + default: + void; + }; + + getleaseres + NQNFSPROC_GETLEASE(getleaseargs) = 19; +.)l +Gets a lease for "file" valid for "duration" seconds from when the lease +was issued on the server\**. +.(f +\** To be safe, the client may only assume that the lease is valid +for ``duration'' seconds from when the RPC request was sent to the server. +.)f +The lease permits client caching if "cachable" is true. +The modify revision level and attributes for the file are also returned. +.ip \(bu +Eviction Message +.(l + void + NQNFSPROC_EVICTED (fhandle) = 21; +.)l +This message is sent from the server to the client. When the client receives +the message, it should flush data associated with the file represented by +"fhandle" from its caches and then send the \fBVacated Message\fR back to +the server. Flushing includes pushing any dirty writes via. write RPCs. +.ip \(bu +Vacated Message +.(l + void + NQNFSPROC_VACATED (fhandle) = 20; +.)l +This message is sent from the client to the server in response to the +\fBEviction Message\fR. See above. +.ip \(bu +Access RPC +.(l + struct accessargs { + fhandle file; + bool read_access; + bool write_access; + bool exec_access; + }; + + stat + NQNFSPROC_ACCESS(accessargs) = 22; +.)l +The access RPC does permission checking on the server for the given type +of access required by the client for the file. +Use of this RPC avoids accessibility problems caused by client->server uid +mapping. +.ip \(bu +Piggybacked Get Lease Request +.pp +The piggybacked get lease request is functionally equivalent to the Get Lease +RPC except that is attached to one of the other NQNFS RPC requests as follows. +A getleaserequest is prepended to all of the request arguments for NQNFS +and a getleaserequestres is inserted in all NFS result structures just after +the "stat" field only if "stat == NFS_OK". +.(l + union getleaserequest switch (cachetype type) { + case NQLREAD: + case NQLWRITE: + unsigned duration; + default: + void; + }; + + union getleaserequestres switch (cachetype type) { + case NQLREAD: + case NQLWRITE: + bool cachable; + unsigned duration; + modifyrev rev; + default: + void; + }; +.)l +The get lease request applies to the file that the attached RPC operates on +and the file attributes remain in the same location as for the NFS RPC reply +structure. +.ip \(bu +Three additional "stat" values +.pp +Three additional values have been added to the enumerated type "stat". +.(l + NQNFS_EXPIRED=500 + NQNFS_TRYLATER=501 + NQNFS_AUTHERR=502 +.)l +The "expired" value indicates that a lease has expired. +The "try later" +value is returned by the server when it wishes the client to retry the +RPC request after a short delay. It is used during crash recovery (Section 2) +and may also be useful for server congestion control. +The "authetication error" value is returned for kerberized mount points to +indicate that there is no cached authentication mapping and a Kerberos ticket +for the principal is required. +.sh 2 "Data Types" +.ip \(bu +cachetype +.(l + enum cachetype { + NQLNONE = 0, + NQLREAD = 1, + NQLWRITE = 2 + }; +.)l +Type of lease requested. NQLNONE is used to indicate no piggybacked lease +request. +.ip \(bu +modifyrev +.(l + typedef unsigned hyper modifyrev; +.)l +The "modifyrev" is a unsigned quadword integer value that is never zero +and increases every time the corresponding file is modified on the server. +.ip \(bu +nqnfs_time +.(l + struct nqnfs_time { + unsigned seconds; + unsigned nano_seconds; + }; +.)l +For NQNFS times are handled at nano second resolution instead of micro second +resolution for NFS. +.ip \(bu +nqnfs_fattr +.(l + struct nqnfs_fattr { + ftype type; + unsigned mode; + unsigned nlink; + unsigned uid; + unsigned gid; + unsigned hyper size; + unsigned blocksize; + unsigned rdev; + unsigned hyper bytes; + unsigned fsid; + unsigned fileid; + nqnfs_time atime; + nqnfs_time mtime; + nqnfs_time ctime; + unsigned flags; + unsigned generation; + modifyrev rev; + }; +.)l +The nqnfs_fattr structure is modified from the NFS fattr so that it stores +the file size as a 64bit quantity and the storage occupied as a 64bit number +of bytes. It also has fields added for the 4.4BSD va_flags and va_gen fields +as well as the file's modify rev level. +.ip \(bu +nqnfs_sattr +.(l + struct nqnfs_sattr { + unsigned mode; + unsigned uid; + unsigned gid; + unsigned hyper size; + nqnfs_time atime; + nqnfs_time mtime; + unsigned flags; + unsigned rdev; + }; +.)l +The nqnfs_sattr structure is modified from the NFS sattr structure in the +same manner as fattr. +.lp +The arguments to several of the NFS RPCs have been modified as well. Mostly, +these are minor changes to use 64bit file offsets or similar. The modified +argument structures follow. +.ip \(bu +Lookup RPC +.(l + struct lookup_diropargs { + unsigned duration; + fhandle dir; + filename name; + }; + + union lookup_diropres switch (stat status) { + case NFS_OK: + struct { + union getleaserequestres lookup_lease; + fhandle file; + nqnfs_fattr attributes; + } lookup_diropok; + default: + void; + }; + +.)l +The additional "duration" argument tells the server to get a lease for the +name being looked up if it is non-zero and the lease is specified +in "lookup_lease". +.ip \(bu +Read RPC +.(l + struct nqnfs_readargs { + fhandle file; + unsigned hyper offset; + unsigned count; + }; +.)l +.ip \(bu +Write RPC +.(l + struct nqnfs_writeargs { + fhandle file; + unsigned hyper offset; + bool append; + nfsdata data; + }; +.)l +The "append" argument is true for apeend only write operations. +.ip \(bu +Get Filesystem Attributes RPC +.(l + union nqnfs_statfsres (stat status) { + case NFS_OK: + struct { + unsigned tsize; + unsigned bsize; + unsigned blocks; + unsigned bfree; + unsigned bavail; + unsigned files; + unsigned files_free; + } info; + default: + void; + }; +.)l +The "files" field is the number of files in the file system and the "files_free" +is the number of additional files that can be created. +.sh 1 "Summary" +.pp +The configuration and tuning of an NFS environment tends to be a bit of a +mystic art, but hopefully this paper along with the man pages and other +reading will be helpful. Good Luck. diff --git a/share/doc/smm/06.nfs/Makefile b/share/doc/smm/06.nfs/Makefile new file mode 100644 index 00000000000..e36a0a613c7 --- /dev/null +++ b/share/doc/smm/06.nfs/Makefile @@ -0,0 +1,7 @@ +# @(#)Makefile 8.1 (Berkeley) 6/8/93 + +DIR= smm/06.nfs +SRCS= 0.t 1.t 2.t ref.t +MACROS= -me + +.include <bsd.doc.mk> diff --git a/share/doc/smm/06.nfs/ref.t b/share/doc/smm/06.nfs/ref.t new file mode 100644 index 00000000000..039363bb0a4 --- /dev/null +++ b/share/doc/smm/06.nfs/ref.t @@ -0,0 +1,123 @@ +.\" Copyright (c) 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" This document is derived from software contributed to Berkeley by +.\" Rick Macklem at The University of Guelph. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)ref.t 8.1 (Berkeley) 6/8/93 +.\" +.sh 1 "Bibliography" +.ip [Baker91] 16 +Mary Baker and John Ousterhout, Availability in the Sprite Distributed +File System, In \fIOperating System Review\fR, (25)2, pg. 95-98, +April 1991. +.ip [Baker91a] 16 +Mary Baker, Private Email Communication, May 1991. +.ip [Burrows88] 16 +Michael Burrows, Efficient Data Sharing, Technical Report #153, +Computer Laboratory, University of Cambridge, Dec. 1988. +.ip [Gray89] 16 +Cary G. Gray and David R. Cheriton, Leases: An Efficient Fault-Tolerant +Mechanism for Distributed File Cache Consistency, In \fIProc. of the +Twelfth ACM Symposium on Operating Systems Principals\fR, Litchfield Park, +AZ, Dec. 1989. +.ip [Howard88] 16 +John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, +M. Satyanarayanan, Robert N. Sidebotham and Michael J. West, +Scale and Performance in a Distributed File System, \fIACM Trans. on +Computer Systems\fR, (6)1, pg 51-81, Feb. 1988. +.ip [Juszczak89] 16 +Chet Juszczak, Improving the Performance and Correctness of an NFS Server, +In \fIProc. Winter 1989 USENIX Conference,\fR pg. 53-63, San Diego, CA, January 1989. +.ip [Keith90] 16 +Bruce E. Keith, Perspectives on NFS File Server Performance Characterization, +In \fIProc. Summer 1990 USENIX Conference\fR, pg. 267-277, Anaheim, CA, +June 1990. +.ip [Kent87] 16 +Christopher. A. Kent, \fICache Coherence in Distributed Systems\fR, +Research Report 87/4, +Digital Equipment Corporation Western Research Laboratory, April 1987. +.ip [Kent87a] 16 +Christopher. A. Kent and Jeffrey C. Mogul, +\fIFragmentation Considered Harmful\fR, Research Report 87/3, +Digital Equipment Corporation Western Research Laboratory, Dec. 1987. +.ip [Macklem91] 16 +Rick Macklem, Lessons Learned Tuning the 4.3BSD Reno Implementation of the +NFS Protocol, In \fIProc. Winter USENIX Conference\fR, pg. 53-64, +Dallas, TX, January 1991. +.ip [Nelson88] 16 +Michael N. Nelson, Brent B. Welch, and John K. Ousterhout, Caching in the +Sprite Network File System, \fIACM Transactions on Computer Systems\fR (6)1 +pg. 134-154, February 1988. +.ip [Nowicki89] 16 +Bill Nowicki, Transport Issues in the Network File System, In +\fIComputer Communication Review\fR, pg. 16-20, Vol. 19, Number 2, April 1989. +.ip [Ousterhout90] 16 +John K. Ousterhout, Why Aren't Operating Systems Getting Faster As Fast as +Hardware? In \fIProc. Summer 1990 USENIX Conference\fR, pg. 247-256, Anaheim, +CA, June 1990. +.ip [Pendry93] 16 +Jan-Simon Pendry, 4.4 BSD Automounter Reference Manual, In +\fIsrc/usr.sbin/amd/doc directory of 4.4 BSD distribution tape\fR. +.ip [Reid90] 16 +Jim Reid, N(e)FS: the Protocol is the Problem, In +\fIProc. Summer 1990 UKUUG Conference\fR, +London, England, July 1990. +.ip [Sandberg85] 16 +Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon, +Design and Implementation of the Sun Network filesystem, In \fIProc. Summer +1985 USENIX Conference\fR, pages 119-130, Portland, OR, June 1985. +.ip [Schroeder85] 16 +Michael D. Schroeder, David K. Gifford and Roger M. Needham, A Caching +File System For A Programmer's Workstation, In \fIProc. of the Tenth +ACM Symposium on Operating Systems Principals\fR, pg. 25-34, Orcas Island, +WA, Dec. 1985. +.ip [Srinivasan89] 16 +V. Srinivasan and Jeffrey. C. Mogul, \fISpritely NFS: Implementation and +Performance of Cache-Consistency Protocols\fR, Research Report 89/5, +Digital Equipment Corporation Western Research Laboratory, May 1989. +.ip [Steiner88] 16 +Jennifer G. Steiner, Clifford Neuman and Jeffrey I. Schiller, +Kerberos: An Authentication Service for Open Network Systems, In +\fIProc. Winter 1988 USENIX Conference\fR, Dallas, TX, February 1988. +.ip [Stern] 16 +Hal Stern, \fIManaging NFS and NIS\fR, O'Reilly and Associates, +ISBN 0-937175-75-7. +.ip [Sun87] 16 +Sun Microsystems Inc., \fIXDR: External Data Representation Standard\fR, +RFC1014, Network Information Center, SRI International, June 1987. +.ip [Sun88] 16 +Sun Microsystems Inc., \fIRPC: Remote Procedure Call Protocol Specification Version 2\fR, +RFC1057, Network Information Center, SRI International, June 1988. +.ip [Sun89] 16 +Sun Microsystems Inc., \fINFS: Network File System Protocol Specification\fR, +ARPANET Working Group Requests for Comment, DDN Network Information Center, +SRI International, Menlo Park, CA, March 1989, RFC-1094. diff --git a/share/doc/smm/18.net/0.t b/share/doc/smm/18.net/0.t new file mode 100644 index 00000000000..d16e56fb995 --- /dev/null +++ b/share/doc/smm/18.net/0.t @@ -0,0 +1,184 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)0.t 8.1 (Berkeley) 6/10/93 +.\" +.de IR +\fI\\$1\fP\\$2 +.. +.if n .ND +.TL +Networking Implementation Notes +.br +4.4BSD Edition +.AU +Samuel J. Leffler, William N. Joy, Robert S. Fabry, and Michael J. Karels +.AI +Computer Systems Research Group +Computer Science Division +Department of Electrical Engineering and Computer Science +University of California, Berkeley +Berkeley, CA 94720 +.AB +.FS +* UNIX is a trademark of Bell Laboratories. +.FE +This report describes the internal structure of the +networking facilities developed for the 4.4BSD version +of the UNIX* operating system +for the VAX\(dg. These facilities +.FS +\(dg DEC, VAX, DECnet, and UNIBUS are trademarks of +Digital Equipment Corporation. +.FE +are based on several central abstractions which +structure the external (user) view of network communication +as well as the internal (system) implementation. +.PP +The report documents the internal structure of the networking system. +The ``Berkeley Software Architecture Manual, 4.4BSD Edition'' (PSD:5) +provides a description of the user interface to the networking facilities. +.sp +.LP +Revised June 10, 1993 +.AE +.LP +.\".de PT +.\".lt \\n(LLu +.\".pc % +.\".nr PN \\n% +.\".tl '\\*(LH'\\*(CH'\\*(RH' +.\".lt \\n(.lu +.\".. +.\".ds RH Contents +.OH 'Networking Implementation Notes''SMM:18-%' +.EH 'SMM:18-%''Networking Implementation Notes' +.bp +.ce +.B "TABLE OF CONTENTS" +.LP +.sp 1 +.nf +.B "1. Introduction" +.LP +.sp .5v +.nf +.B "2. Overview" +.LP +.sp .5v +.nf +.B "3. Goals +.LP +.sp .5v +.nf +.B "4. Internal address representation" +.LP +.sp .5v +.nf +.B "5. Memory management" +.LP +.sp .5v +.nf +.B "6. Internal layering +6.1. Socket layer +6.1.1. Socket state +6.1.2. Socket data queues +6.1.3. Socket connection queuing +6.2. Protocol layer(s) +6.3. Network-interface layer +6.3.1. UNIBUS interfaces +.LP +.sp .5v +.nf +.B "7. Socket/protocol interface" +.LP +.sp .5v +.nf +.B "8. Protocol/protocol interface" +8.1. pr_output +8.2. pr_input +8.3. pr_ctlinput +8.4. pr_ctloutput +.LP +.sp .5v +.nf +.B "9. Protocol/network-interface interface" +9.1. Packet transmission +9.2. Packet reception +.LP +.sp .5v +.nf +.B "10. Gateways and routing issues +10.1. Routing tables +10.2. Routing table interface +10.3. User level routing policies +.LP +.sp .5v +.nf +.B "11. Raw sockets" +11.1. Control blocks +11.2. Input processing +11.3. Output processing +.LP +.sp .5v +.nf +.B "12. Buffering and congestion control" +12.1. Memory management +12.2. Protocol buffering policies +12.3. Queue limiting +12.4. Packet forwarding +.LP +.sp .5v +.nf +.B "13. Out of band data" +.LP +.sp .5v +.nf +.B "14. Trailer protocols" +.LP +.sp .5v +.nf +.B Acknowledgements +.LP +.sp .5v +.nf +.B References +.bp +.de _d +.if t .ta .6i 2.1i 2.6i +.\" 2.94 went to 2.6, 3.64 to 3.30 +.if n .ta .84i 2.6i 3.30i +.. +.de _f +.if t .ta .5i 1.25i 2.5i +.\" 3.5i went to 3.8i +.if n .ta .7i 1.75i 3.8i +.. diff --git a/share/doc/smm/18.net/1.t b/share/doc/smm/18.net/1.t new file mode 100644 index 00000000000..ba5adb534e7 --- /dev/null +++ b/share/doc/smm/18.net/1.t @@ -0,0 +1,66 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)1.t 8.1 (Berkeley) 6/8/93 +.\" +.\".ds RH Introduction +.br +.ne 2i +.NH +\s+2Introduction\s0 +.PP +This report describes the internal structure of +facilities added to the +4.2BSD version of the UNIX operating system for +the VAX, +as modified in the 4.4BSD release. +The system facilities provide +a uniform user interface to networking +within UNIX. In addition, the implementation +introduces a structure for network communications which may be +used by system implementors in adding new networking +facilities. The internal structure is not visible +to the user, rather it is intended to aid implementors +of communication protocols and network services by +providing a framework which +promotes code sharing and minimizes implementation effort. +.PP +The reader is expected to be familiar with the C programming +language and system interface, as described in the +\fIBerkeley Software Architecture Manual, 4.4BSD Edition\fP [Joy86]. +Basic understanding of network +communication concepts is assumed; where required +any additional ideas are introduced. +.PP +The remainder of this document +provides a description of the system internals, +avoiding, when possible, those portions which are utilized only +by the interprocess communication facilities. diff --git a/share/doc/smm/18.net/2.t b/share/doc/smm/18.net/2.t new file mode 100644 index 00000000000..f5048897cba --- /dev/null +++ b/share/doc/smm/18.net/2.t @@ -0,0 +1,85 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)2.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH Overview +.br +.ne 2i +.NH +\s+2Overview\s0 +.PP +If we consider +the International Standards Organization's (ISO) +Open System Interconnection (OSI) model of +network communication [ISO81] [Zimmermann80], +the networking facilities +described here correspond to a portion of the +session layer (layer 3) and all of the transport and +network layers (layers 2 and 1, respectively). +.PP +The network layer provides possibly imperfect +data transport services with minimal addressing +structure. +Addressing at this level is normally host to host, +with implicit or explicit routing optionally supported +by the communicating agents. +.PP +At the transport +layer the notions of reliable transfer, data sequencing, +flow control, and service addressing are normally +included. Reliability is usually managed by +explicit acknowledgement of data delivered. Failure +to acknowledge a transfer results in retransmission of +the data. Sequencing may be handled by tagging +each message handed to the network layer by a +\fIsequence number\fP and maintaining +state at the endpoints of communication to utilize +received sequence numbers in reordering data which +arrives out of order. +.PP +The session layer facilities may provide forms of +addressing which are mapped into formats required +by the transport layer, service authentication +and client authentication, etc. Various systems +also provide services such as data encryption and +address and protocol translation. +.PP +The following sections begin by describing some of the common +data structures and utility routines, then examine +the internal layering. The contents of each layer +and its interface are considered. Certain of the +interfaces are protocol implementation specific. For +these cases examples have been drawn from the Internet [Cerf78] +protocol family. Later sections cover routing issues, +the design of the raw socket interface and other +miscellaneous topics. diff --git a/share/doc/smm/18.net/3.t b/share/doc/smm/18.net/3.t new file mode 100644 index 00000000000..1d1fddd1fb5 --- /dev/null +++ b/share/doc/smm/18.net/3.t @@ -0,0 +1,59 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)3.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH Goals +.br +.ne 2i +.NH +\s+2Goals\s0 +.PP +The networking system was designed with the goal of supporting +multiple \fIprotocol families\fP and addressing styles. This required +information to be ``hidden'' in common data structures which +could be manipulated by all the pieces of the system, but which +required interpretation only by the protocols which ``controlled'' +it. The system described here attempts to minimize +the use of shared data structures to those kept by a suite of +protocols (a \fIprotocol family\fP), and those used for rendezvous +between ``synchronous'' and ``asynchronous'' portions of the +system (e.g. queues of data packets are filled at interrupt +time and emptied based on user requests). +.PP +A major goal of the system was to provide a framework within +which new protocols and hardware could be easily be supported. +To this end, a great deal of effort has been extended to +create utility routines which hide many of the more +complex and/or hardware dependent chores of networking. +Later sections describe the utility routines and the underlying +data structures they manipulate. diff --git a/share/doc/smm/18.net/4.t b/share/doc/smm/18.net/4.t new file mode 100644 index 00000000000..afa6913fcad --- /dev/null +++ b/share/doc/smm/18.net/4.t @@ -0,0 +1,67 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)4.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Address representation +.br +.ne 2i +.NH +\s+2Internal address representation\s0 +.PP +Common to all portions of the system are two data structures. +These structures are used to represent +addresses and various data objects. +Addresses, internally are described by the \fIsockaddr\fP structure, +.DS +._f +struct sockaddr { + short sa_family; /* data format identifier */ + char sa_data[14]; /* address */ +}; +.DE +All addresses belong to one or more \fIaddress families\fP +which define their format and interpretation. +The \fIsa_family\fP field indicates the address family to which the address +belongs, and the \fIsa_data\fP field contains the actual data value. +The size of the data field, 14 bytes, was selected based on a study +of current address formats.* +Specific address formats use private structure definitions +that define the format of the data field. +The system interface supports larger address structures, +although address-family-independent support facilities, for example routing +and raw socket interfaces, provide only 14 bytes for address storage. +Protocols that do not use those facilities (e.g, the current Unix domain) +may use larger data areas. +.FS +* Later versions of the system may support variable length addresses. +.FE diff --git a/share/doc/smm/18.net/5.t b/share/doc/smm/18.net/5.t new file mode 100644 index 00000000000..d4fb8e38fa2 --- /dev/null +++ b/share/doc/smm/18.net/5.t @@ -0,0 +1,184 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)5.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Memory management +.br +.ne 2i +.NH +\s+2Memory management\s0 +.PP +A single mechanism is used for data storage: memory buffers, or +\fImbuf\fP's. An mbuf is a structure of the form: +.DS +._f +struct mbuf { + struct mbuf *m_next; /* next buffer in chain */ + u_long m_off; /* offset of data */ + short m_len; /* amount of data in this mbuf */ + short m_type; /* mbuf type (accounting) */ + u_char m_dat[MLEN]; /* data storage */ + struct mbuf *m_act; /* link in higher-level mbuf list */ +}; +.DE +The \fIm_next\fP field is used to chain mbufs together on linked +lists, while the \fIm_act\fP field allows lists of mbuf chains to be +accumulated. By convention, the mbufs common to a single object +(for example, a packet) are chained together with the \fIm_next\fP +field, while groups of objects are linked via the \fIm_act\fP +field (possibly when in a queue). +.PP +Each mbuf has a small data area for storing information, \fIm_dat\fP. +The \fIm_len\fP field indicates the amount of data, while the \fIm_off\fP +field is an offset to the beginning of the data from the base of the +mbuf. Thus, for example, the macro \fImtod\fP, which converts a pointer +to an mbuf to a pointer to the data stored in the mbuf, has the form +.DS +._d +#define mtod(\fIx\fP,\fIt\fP) ((\fIt\fP)((int)(\fIx\fP) + (\fIx\fP)->m_off)) +.DE +(note the \fIt\fP parameter, a C type cast, which is used to cast +the resultant pointer for proper assignment). +.PP +In addition to storing data directly in the mbuf's data area, data +of page size may be also be stored in a separate area of memory. +The mbuf utility routines maintain +a pool of pages for this purpose and manipulate a private page map +for such pages. +An mbuf with an external data area may be recognized by the larger +offset to the data area; +this is formalized by the macro M_HASCL(\fIm\fP), which is true +if the mbuf whose address is \fIm\fP has an external page cluster. +An array of reference counts on pages is also maintained +so that copies of pages may be made without core to core +copying (copies are created simply by duplicating the reference to the data +and incrementing the associated reference counts for the pages). +Separate data pages are currently used only +when copying data from a user process into the kernel, +and when bringing data in at the hardware level. Routines which +manipulate mbufs are not normally aware whether data is stored directly in +the mbuf data array, or if it is kept in separate pages. +.PP +The following may be used to allocate and free mbufs: +.LP +m = m_get(wait, type); +.br +MGET(m, wait, type); +.IP +The subroutine \fIm_get\fP and the macro \fIMGET\fP +each allocate an mbuf, placing its address in \fIm\fP. +The argument \fIwait\fP is either M_WAIT or M_DONTWAIT according +to whether allocation should block or fail if no mbuf is available. +The \fItype\fP is one of the predefined mbuf types for use in accounting +of mbuf allocation. +.IP "MCLGET(m);" +This macro attempts to allocate an mbuf page cluster +to associate with the mbuf \fIm\fP. +If successful, the length of the mbuf is set to CLSIZE, +the size of the page cluster. +.LP +n = m_free(m); +.br +MFREE(m,n); +.IP +The routine \fIm_free\fP and the macro \fIMFREE\fP +each free a single mbuf, \fIm\fP, and any associated external storage area, +placing a pointer to its successor in the chain it heads, if any, in \fIn\fP. +.IP "m_freem(m);" +This routine frees an mbuf chain headed by \fIm\fP. +.PP +The following utility routines are available for manipulating mbuf +chains: +.IP "m = m_copy(m0, off, len);" +.br +The \fIm_copy\fP routine create a copy of all, or part, of a +list of the mbufs in \fIm0\fP. \fILen\fP bytes of data, starting +\fIoff\fP bytes from the front of the chain, are copied. +Where possible, reference counts on pages are used instead +of core to core copies. The original mbuf chain must have at +least \fIoff\fP + \fIlen\fP bytes of data. If \fIlen\fP is +specified as M_COPYALL, all the data present, offset +as before, is copied. +.IP "m_cat(m, n);" +.br +The mbuf chain, \fIn\fP, is appended to the end of \fIm\fP. +Where possible, compaction is performed. +.IP "m_adj(m, diff);" +.br +The mbuf chain, \fIm\fP is adjusted in size by \fIdiff\fP +bytes. If \fIdiff\fP is non-negative, \fIdiff\fP bytes +are shaved off the front of the mbuf chain. If \fIdiff\fP +is negative, the alteration is performed from back to front. +No space is reclaimed in this operation; alterations are +accomplished by changing the \fIm_len\fP and \fIm_off\fP +fields of mbufs. +.IP "m = m_pullup(m0, size);" +.br +After a successful call to \fIm_pullup\fP, the mbuf at +the head of the returned list, \fIm\fP, is guaranteed +to have at least \fIsize\fP +bytes of data in contiguous memory within the data area of the mbuf +(allowing access via a pointer, obtained using the \fImtod\fP macro, +and allowing the mbuf to be located from a pointer to the data area +using \fIdtom\fP, defined below). +If the original data was less than \fIsize\fP bytes long, +\fIlen\fP was greater than the size of an mbuf data +area (112 bytes), or required resources were unavailable, +\fIm\fP is 0 and the original mbuf chain is deallocated. +.IP +This routine is particularly useful when verifying packet +header lengths on reception. For example, if a packet is +received and only 8 of the necessary 16 bytes required +for a valid packet header are present at the head of the list +of mbufs representing the packet, the remaining 8 bytes +may be ``pulled up'' with a single \fIm_pullup\fP call. +If the call fails the invalid packet will have been discarded. +.PP +By insuring that mbufs always reside on 128 byte boundaries, +it is always possible to locate the mbuf associated with a data +area by masking off the low bits of the virtual address. +This allows modules to store data structures in mbufs and +pass them around without concern for locating the original +mbuf when it comes time to free the structure. +Note that this works only with objects stored in the internal data +buffer of the mbuf. +The \fIdtom\fP macro is used to convert a pointer into an mbuf's +data area to a pointer to the mbuf, +.DS +#define dtom(x) ((struct mbuf *)((int)x & ~(MSIZE-1))) +.DE +.PP +Mbufs are used for dynamically allocated data structures such as +sockets as well as memory allocated for packets and headers. Statistics are +maintained on mbuf usage and can be viewed by users using the +\fInetstat\fP\|(1) program. diff --git a/share/doc/smm/18.net/6.t b/share/doc/smm/18.net/6.t new file mode 100644 index 00000000000..601988cdea8 --- /dev/null +++ b/share/doc/smm/18.net/6.t @@ -0,0 +1,664 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)6.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Internal layering +.br +.ne 2i +.NH +\s+2Internal layering\s0 +.PP +The internal structure of the network system is divided into +three layers. These +layers correspond to the services provided by the socket +abstraction, those provided by the communication protocols, +and those provided by the hardware interfaces. The communication +protocols are normally layered into two or more individual +cooperating layers, though they are collectively viewed +in the system as one layer providing services supportive +of the appropriate socket abstraction. +.PP +The following sections describe the properties of each layer +in the system and the interfaces to which each must conform. +.NH 2 +Socket layer +.PP +The socket layer deals with the interprocess communication +facilities provided by the system. A socket is a bidirectional +endpoint of communication which is ``typed'' by the semantics +of communication it supports. The system calls described in +the \fIBerkeley Software Architecture Manual\fP [Joy86] +are used to manipulate sockets. +.PP +A socket consists of the following data structure: +.DS +._f +struct socket { + short so_type; /* generic type */ + short so_options; /* from socket call */ + short so_linger; /* time to linger while closing */ + short so_state; /* internal state flags */ + caddr_t so_pcb; /* protocol control block */ + struct protosw *so_proto; /* protocol handle */ + struct socket *so_head; /* back pointer to accept socket */ + struct socket *so_q0; /* queue of partial connections */ + short so_q0len; /* partials on so_q0 */ + struct socket *so_q; /* queue of incoming connections */ + short so_qlen; /* number of connections on so_q */ + short so_qlimit; /* max number queued connections */ + struct sockbuf so_rcv; /* receive queue */ + struct sockbuf so_snd; /* send queue */ + short so_timeo; /* connection timeout */ + u_short so_error; /* error affecting connection */ + u_short so_oobmark; /* chars to oob mark */ + short so_pgrp; /* pgrp for signals */ +}; +.DE +.PP +Each socket contains two data queues, \fIso_rcv\fP and \fIso_snd\fP, +and a pointer to routines which provide supporting services. +The type of the socket, +\fIso_type\fP is defined at socket creation time and used in selecting +those services which are appropriate to support it. The supporting +protocol is selected at socket creation time and recorded in +the socket data structure for later use. Protocols are defined +by a table of procedures, the \fIprotosw\fP structure, which will +be described in detail later. A pointer to a protocol-specific +data structure, +the ``protocol control block,'' is also present in the socket structure. +Protocols control this data structure, which normally includes a +back pointer to the parent socket structure to allow easy +lookup when returning information to a user +(for example, placing an error number in the \fIso_error\fP +field). The other entries in the socket structure are used in +queuing connection requests, validating user requests, storing +socket characteristics (e.g. +options supplied at the time a socket is created), and maintaining +a socket's state. +.PP +Processes ``rendezvous at a socket'' in many instances. For instance, +when a process wishes to extract data from a socket's receive queue +and it is empty, or lacks sufficient data to satisfy the request, +the process blocks, supplying the address of the receive queue as +a ``wait channel' to be used in notification. When data arrives +for the process and is placed in the socket's queue, the blocked +process is identified by the fact it is waiting ``on the queue.'' +.NH 3 +Socket state +.PP +A socket's state is defined from the following: +.DS +.ta \w'#define 'u +\w'SS_ISDISCONNECTING 'u +\w'0x000 'u +#define SS_NOFDREF 0x001 /* no file table ref any more */ +#define SS_ISCONNECTED 0x002 /* socket connected to a peer */ +#define SS_ISCONNECTING 0x004 /* in process of connecting to peer */ +#define SS_ISDISCONNECTING 0x008 /* in process of disconnecting */ +#define SS_CANTSENDMORE 0x010 /* can't send more data to peer */ +#define SS_CANTRCVMORE 0x020 /* can't receive more data from peer */ +#define SS_RCVATMARK 0x040 /* at mark on input */ + +#define SS_PRIV 0x080 /* privileged */ +#define SS_NBIO 0x100 /* non-blocking ops */ +#define SS_ASYNC 0x200 /* async i/o notify */ +.DE +.PP +The state of a socket is manipulated both by the protocols +and the user (through system calls). +When a socket is created, the state is defined based on the type of socket. +It may change as control actions are performed, for example connection +establishment. +It may also change according to the type of +input/output the user wishes to perform, as indicated by options +set with \fIfcntl\fP. ``Non-blocking'' I/O implies that +a process should never be blocked to await resources. Instead, any +call which would block returns prematurely +with the error EWOULDBLOCK, or the service request may be partially +fulfilled, e.g. a request for more data than is present. +.PP +If a process requested ``asynchronous'' notification of events +related to the socket, the SIGIO signal is posted to the process +when such events occur. +An event is a change in the socket's state; +examples of such occurrences are: space +becoming available in the send queue, new data available in the +receive queue, connection establishment or disestablishment, etc. +.PP +A socket may be marked ``privileged'' if it was created by the +super-user. Only privileged sockets may +bind addresses in privileged portions of an address space +or use ``raw'' sockets to access lower levels of the network. +.NH 3 +Socket data queues +.PP +A socket's data queue contains a pointer to the data stored in +the queue and other entries related to the management of +the data. The following structure defines a data queue: +.DS +._f +struct sockbuf { + u_short sb_cc; /* actual chars in buffer */ + u_short sb_hiwat; /* max actual char count */ + u_short sb_mbcnt; /* chars of mbufs used */ + u_short sb_mbmax; /* max chars of mbufs to use */ + u_short sb_lowat; /* low water mark */ + short sb_timeo; /* timeout */ + struct mbuf *sb_mb; /* the mbuf chain */ + struct proc *sb_sel; /* process selecting read/write */ + short sb_flags; /* flags, see below */ +}; +.DE +.PP +Data is stored in a queue as a chain of mbufs. +The actual count of data characters as well as high and low water marks are +used by the protocols in controlling the flow of data. +The amount of buffer space (characters of mbufs and associated data pages) +is also recorded along with the limit on buffer allocation. +The socket routines cooperate in implementing the flow control +policy by blocking a process when it requests to send data and +the high water mark has been reached, or when it requests to +receive data and less than the low water mark is present +(assuming non-blocking I/O has not been specified).* +.FS +* The low-water mark is always presumed to be 0 +in the current implementation. +.FE +.PP +When a socket is created, the supporting protocol ``reserves'' space +for the send and receive queues of the socket. +The limit on buffer allocation is set somewhat higher than the limit +on data characters +to account for the granularity of buffer allocation. +The actual storage associated with a +socket queue may fluctuate during a socket's lifetime, but it is assumed +that this reservation will always allow a protocol to acquire enough memory +to satisfy the high water marks. +.PP +The timeout and select values are manipulated by the socket routines +in implementing various portions of the interprocess communications +facilities and will not be described here. +.PP +Data queued at a socket is stored in one of two styles. +Stream-oriented sockets queue data with no addresses, headers +or record boundaries. +The data are in mbufs linked through the \fIm_next\fP field. +Buffers containing access rights may be present within the chain +if the underlying protocol supports passage of access rights. +Record-oriented sockets, including datagram sockets, +queue data as a list of packets; the sections of packets are distinguished +by the types of the mbufs containing them. +The mbufs which comprise a record are linked through the \fIm_next\fP field; +records are linked from the \fIm_act\fP field of the first mbuf +of one packet to the first mbuf of the next. +Each packet begins with an mbuf containing the ``from'' address +if the protocol provides it, +then any buffers containing access rights, and finally any buffers +containing data. +If a record contains no data, +no data buffers are required unless neither address nor access rights +are present. +.PP +A socket queue has a number of flags used in synchronizing access +to the data and in acquiring resources: +.DS +._d +#define SB_LOCK 0x01 /* lock on data queue (so_rcv only) */ +#define SB_WANT 0x02 /* someone is waiting to lock */ +#define SB_WAIT 0x04 /* someone is waiting for data/space */ +#define SB_SEL 0x08 /* buffer is selected */ +#define SB_COLL 0x10 /* collision selecting */ +.DE +The last two flags are manipulated by the system in implementing +the select mechanism. +.NH 3 +Socket connection queuing +.PP +In dealing with connection oriented sockets (e.g. SOCK_STREAM) +the two ends are considered distinct. One end is termed +\fIactive\fP, and generates connection requests. The other +end is called \fIpassive\fP and accepts connection requests. +.PP +From the passive side, a socket is marked with +SO_ACCEPTCONN when a \fIlisten\fP call is made, +creating two queues of sockets: \fIso_q0\fP for connections +in progress and \fIso_q\fP for connections already made and +awaiting user acceptance. +As a protocol is preparing incoming connections, it creates +a socket structure queued on \fIso_q0\fP by calling the routine +\fIsonewconn\fP(). When the connection +is established, the socket structure is then transferred +to \fIso_q\fP, making it available for an \fIaccept\fP. +.PP +If an SO_ACCEPTCONN socket is closed with sockets on either +\fIso_q0\fP or \fIso_q\fP, these sockets are dropped, +with notification to the peers as appropriate. +.NH 2 +Protocol layer(s) +.PP +Each socket is created in a communications domain, +which usually implies both an addressing structure (address family) +and a set of protocols which implement various socket types within the domain +(protocol family). +Each domain is defined by the following structure: +.DS +.ta .5i +\w'struct 'u +\w'(*dom_externalize)(); 'u +struct domain { + int dom_family; /* PF_xxx */ + char *dom_name; + int (*dom_init)(); /* initialize domain data structures */ + int (*dom_externalize)(); /* externalize access rights */ + int (*dom_dispose)(); /* dispose of internalized rights */ + struct protosw *dom_protosw, *dom_protoswNPROTOSW; + struct domain *dom_next; +}; +.DE +.PP +At boot time, each domain configured into the kernel +is added to a linked list of domain. +The initialization procedure of each domain is then called. +After that time, the domain structure is used to locate protocols +within the protocol family. +It may also contain procedure references +for externalization of access rights at the receiving socket +and the disposal of access rights that are not received. +.PP +Protocols are described by a set of entry points and certain +socket-visible characteristics, some of which are used in +deciding which socket type(s) they may support. +.PP +An entry in the ``protocol switch'' table exists for each +protocol module configured into the system. It has the following form: +.DS +.ta .5i +\w'struct 'u +\w'domain *pr_domain; 'u +struct protosw { + short pr_type; /* socket type used for */ + struct domain *pr_domain; /* domain protocol a member of */ + short pr_protocol; /* protocol number */ + short pr_flags; /* socket visible attributes */ +/* protocol-protocol hooks */ + int (*pr_input)(); /* input to protocol (from below) */ + int (*pr_output)(); /* output to protocol (from above) */ + int (*pr_ctlinput)(); /* control input (from below) */ + int (*pr_ctloutput)(); /* control output (from above) */ +/* user-protocol hook */ + int (*pr_usrreq)(); /* user request */ +/* utility hooks */ + int (*pr_init)(); /* initialization routine */ + int (*pr_fasttimo)(); /* fast timeout (200ms) */ + int (*pr_slowtimo)(); /* slow timeout (500ms) */ + int (*pr_drain)(); /* flush any excess space possible */ +}; +.DE +.PP +A protocol is called through the \fIpr_init\fP entry before any other. +Thereafter it is called every 200 milliseconds through the +\fIpr_fasttimo\fP entry and +every 500 milliseconds through the \fIpr_slowtimo\fP for timer based actions. +The system will call the \fIpr_drain\fP entry if it is low on space and +this should throw away any non-critical data. +.PP +Protocols pass data between themselves as chains of mbufs using +the \fIpr_input\fP and \fIpr_output\fP routines. \fIPr_input\fP +passes data up (towards +the user) and \fIpr_output\fP passes it down (towards the network); control +information passes up and down on \fIpr_ctlinput\fP and \fIpr_ctloutput\fP. +The protocol is responsible for the space occupied by any of the +arguments to these entries and must either pass it onward or dispose of it. +(On output, the lowest level reached must free buffers storing the arguments; +on input, the highest level is responsible for freeing buffers.) +.PP +The \fIpr_usrreq\fP routine interfaces protocols to the socket +code and is described below. +.PP +The \fIpr_flags\fP field is constructed from the following values: +.DS +.ta \w'#define 'u +\w'PR_CONNREQUIRED 'u +8n +#define PR_ATOMIC 0x01 /* exchange atomic messages only */ +#define PR_ADDR 0x02 /* addresses given with messages */ +#define PR_CONNREQUIRED 0x04 /* connection required by protocol */ +#define PR_WANTRCVD 0x08 /* want PRU_RCVD calls */ +#define PR_RIGHTS 0x10 /* passes capabilities */ +.DE +Protocols which are connection-based specify the PR_CONNREQUIRED +flag so that the socket routines will never attempt to send data +before a connection has been established. If the PR_WANTRCVD flag +is set, the socket routines will notify the protocol when the user +has removed data from the socket's receive queue. This allows +the protocol to implement acknowledgement on user receipt, and +also update windowing information based on the amount of space +available in the receive queue. The PR_ADDR field indicates that any +data placed in the socket's receive queue will be preceded by the +address of the sender. The PR_ATOMIC flag specifies that each \fIuser\fP +request to send data must be performed in a single \fIprotocol\fP send +request; it is the protocol's responsibility to maintain record +boundaries on data to be sent. The PR_RIGHTS flag indicates that the +protocol supports the passing of capabilities; this is currently +used only by the protocols in the UNIX protocol family. +.PP +When a socket is created, the socket routines scan the protocol +table for the domain +looking for an appropriate protocol to support the type of +socket being created. The \fIpr_type\fP field contains one of the +possible socket types (e.g. SOCK_STREAM), while the \fIpr_domain\fP +is a back pointer to the domain structure. +The \fIpr_protocol\fP field contains the protocol number of the +protocol, normally a well-known value. +.NH 2 +Network-interface layer +.PP +Each network-interface configured into a system defines a +path through which packets may be sent and received. +Normally a hardware device is associated with this interface, +though there is no requirement for this (for example, all +systems have a software ``loopback'' interface used for +debugging and performance analysis). +In addition to manipulating the hardware device, an interface +module is responsible +for encapsulation and decapsulation of any link-layer header +information required to deliver a message to its destination. +The selection of which interface to use in delivering packets +is a routing decision carried out at a +higher level than the network-interface layer. +An interface may have addresses in one or more address families. +The address is set at boot time using an \fIioctl\fP on a socket +in the appropriate domain; this operation is implemented by the protocol +family, after verifying the operation through the device \fIioctl\fP entry. +.PP +An interface is defined by the following structure, +.DS +.ta .5i +\w'struct 'u +\w'ifaddr *if_addrlist; 'u +struct ifnet { + char *if_name; /* name, e.g. ``en'' or ``lo'' */ + short if_unit; /* sub-unit for lower level driver */ + short if_mtu; /* maximum transmission unit */ + short if_flags; /* up/down, broadcast, etc. */ + short if_timer; /* time 'til if_watchdog called */ + struct ifaddr *if_addrlist; /* list of addresses of interface */ + struct ifqueue if_snd; /* output queue */ + int (*if_init)(); /* init routine */ + int (*if_output)(); /* output routine */ + int (*if_ioctl)(); /* ioctl routine */ + int (*if_reset)(); /* bus reset routine */ + int (*if_watchdog)(); /* timer routine */ + int if_ipackets; /* packets received on interface */ + int if_ierrors; /* input errors on interface */ + int if_opackets; /* packets sent on interface */ + int if_oerrors; /* output errors on interface */ + int if_collisions; /* collisions on csma interfaces */ + struct ifnet *if_next; +}; +.DE +Each interface address has the following form: +.DS +.ta \w'#define 'u +\w'struct 'u +\w'struct 'u +\w'sockaddr ifa_addr; 'u-\w'struct 'u +struct ifaddr { + struct sockaddr ifa_addr; /* address of interface */ + union { + struct sockaddr ifu_broadaddr; + struct sockaddr ifu_dstaddr; + } ifa_ifu; + struct ifnet *ifa_ifp; /* back-pointer to interface */ + struct ifaddr *ifa_next; /* next address for interface */ +}; +.ta \w'#define 'u +\w'ifa_broadaddr 'u +\w'ifa_ifu.ifu_broadaddr 'u +#define ifa_broadaddr ifa_ifu.ifu_broadaddr /* broadcast address */ +#define ifa_dstaddr ifa_ifu.ifu_dstaddr /* other end of p-to-p link */ +.DE +The protocol generally maintains this structure as part of a larger +structure containing additional information concerning the address. +.PP +Each interface has a send queue and routines used for +initialization, \fIif_init\fP, and output, \fIif_output\fP. +If the interface resides on a system bus, the routine \fIif_reset\fP +will be called after a bus reset has been performed. +An interface may also +specify a timer routine, \fIif_watchdog\fP; +if \fIif_timer\fP is non-zero, it is decremented once per second +until it reaches zero, at which time the watchdog routine is called. +.PP +The state of an interface and certain characteristics are stored in +the \fIif_flags\fP field. The following values are possible: +.DS +._d +#define IFF_UP 0x1 /* interface is up */ +#define IFF_BROADCAST 0x2 /* broadcast is possible */ +#define IFF_DEBUG 0x4 /* turn on debugging */ +#define IFF_LOOPBACK 0x8 /* is a loopback net */ +#define IFF_POINTOPOINT 0x10 /* interface is point-to-point link */ +#define IFF_NOTRAILERS 0x20 /* avoid use of trailers */ +#define IFF_RUNNING 0x40 /* resources allocated */ +#define IFF_NOARP 0x80 /* no address resolution protocol */ +.DE +If the interface is connected to a network which supports transmission +of \fIbroadcast\fP packets, the IFF_BROADCAST flag will be set and +the \fIifa_broadaddr\fP field will contain the address to be used in +sending or accepting a broadcast packet. If the interface is associated +with a point-to-point hardware link (for example, a DEC DMR-11), the +IFF_POINTOPOINT flag will be set and \fIifa_dstaddr\fP will contain the +address of the host on the other side of the connection. These addresses +and the local address of the interface, \fIif_addr\fP, are used in +filtering incoming packets. The interface sets IFF_RUNNING after +it has allocated system resources and posted an initial read on the +device it manages. This state bit is used to avoid multiple allocation +requests when an interface's address is changed. The IFF_NOTRAILERS +flag indicates the interface should refrain from using a \fItrailer\fP +encapsulation on outgoing packets, or (where per-host negotiation +of trailers is possible) that trailer encapsulations should not be requested; +\fItrailer\fP protocols are described +in section 14. The IFF_NOARP flag indicates the interface should not +use an ``address resolution protocol'' in mapping internetwork addresses +to local network addresses. +.PP +Various statistics are also stored in the interface structure. These +may be viewed by users using the \fInetstat\fP(1) program. +.PP +The interface address and flags may be set with the SIOCSIFADDR and +SIOCSIFFLAGS \fIioctl\fP\^s. SIOCSIFADDR is used initially to define each +interface's address; SIOGSIFFLAGS can be used to mark +an interface down and perform site-specific configuration. +The destination address of a point-to-point link is set with SIOCSIFDSTADDR. +Corresponding operations exist to read each value. +Protocol families may also support operations to set and read the broadcast +address. +In addition, the SIOCGIFCONF \fIioctl\fP retrieves a list of interface +names and addresses for all interfaces and protocols on the host. +.NH 3 +UNIBUS interfaces +.PP +All hardware related interfaces currently reside on the UNIBUS. +Consequently a common set of utility routines for dealing +with the UNIBUS has been developed. Each UNIBUS interface +utilizes a structure of the following form: +.DS +.ta \w'#define 'u +\w'ifw_xtofree 'u +\w'pte ifu_wmap[IF_MAXNUBAMR]; 'u +struct ifubinfo { + short iff_uban; /* uba number */ + short iff_hlen; /* local net header length */ + struct uba_regs *iff_uba; /* uba regs, in vm */ + short iff_flags; /* used during uballoc's */ +}; +.DE +Additional structures are associated with each receive and transmit buffer, +normally one each per interface; for read, +.DS +.ta \w'#define 'u +\w'ifw_xtofree 'u +\w'pte ifu_wmap[IF_MAXNUBAMR]; 'u +struct ifrw { + caddr_t ifrw_addr; /* virt addr of header */ + short ifrw_bdp; /* unibus bdp */ + short ifrw_flags; /* type, etc. */ +#define IFRW_W 0x01 /* is a transmit buffer */ + int ifrw_info; /* value from ubaalloc */ + int ifrw_proto; /* map register prototype */ + struct pte *ifrw_mr; /* base of map registers */ +}; +.DE +and for write, +.DS +.ta \w'#define 'u +\w'ifw_xtofree 'u +\w'pte ifu_wmap[IF_MAXNUBAMR]; 'u +struct ifxmt { + struct ifrw ifrw; + caddr_t ifw_base; /* virt addr of buffer */ + struct pte ifw_wmap[IF_MAXNUBAMR]; /* base pages for output */ + struct mbuf *ifw_xtofree; /* pages being dma'd out */ + short ifw_xswapd; /* mask of clusters swapped */ + short ifw_nmr; /* number of entries in wmap */ +}; +.ta \w'#define 'u +\w'ifw_xtofree 'u +\w'pte ifu_wmap[IF_MAXNUBAMR]; 'u +#define ifw_addr ifrw.ifrw_addr +#define ifw_bdp ifrw.ifrw_bdp +#define ifw_flags ifrw.ifrw_flags +#define ifw_info ifrw.ifrw_info +#define ifw_proto ifrw.ifrw_proto +#define ifw_mr ifrw.ifrw_mr +.DE +One of each of these structures is conveniently packaged for interfaces +with single buffers for each direction, as follows: +.DS +.ta \w'#define 'u +\w'ifw_xtofree 'u +\w'pte ifu_wmap[IF_MAXNUBAMR]; 'u +struct ifuba { + struct ifubinfo ifu_info; + struct ifrw ifu_r; + struct ifxmt ifu_xmt; +}; +.ta \w'#define 'u +\w'ifw_xtofree 'u +#define ifu_uban ifu_info.iff_uban +#define ifu_hlen ifu_info.iff_hlen +#define ifu_uba ifu_info.iff_uba +#define ifu_flags ifu_info.iff_flags +#define ifu_w ifu_xmt.ifrw +#define ifu_xtofree ifu_xmt.ifw_xtofree +.DE +.PP +The \fIif_ubinfo\fP structure contains the general information needed +to characterize the I/O-mapped buffers for the device. +In addition, there is a structure describing each buffer, including +UNIBUS resources held by the interface. +Sufficient memory pages and bus map registers are allocated to each buffer +upon initialization according to the maximum packet size and header length. +The kernel virtual address of the buffer is held in \fIifrw_addr\fP, +and the map registers begin +at \fIifrw_mr\fP. UNIBUS map register \fIifrw_mr\fP\^[\-1] +maps the local network header +ending on a page boundary. UNIBUS data paths are +reserved for read and for +write, given by \fIifrw_bdp\fP. The prototype of the map +registers for read and for write is saved in \fIifrw_proto\fP. +.PP +When write transfers are not at least half-full pages on page boundaries, +the data are just copied into the pages mapped on the UNIBUS +and the transfer is started. +If a write transfer is at least half a page long and on a page +boundary, UNIBUS page table entries are swapped to reference +the pages, and then the initial pages are +remapped from \fIifw_wmap\fP when the transfer completes. +The mbufs containing the mapped pages are placed on the \fIifw_xtofree\fP +queue to be freed after transmission. +.PP +When read transfers give at least half a page of data to be input, page +frames are allocated from a network page list and traded +with the pages already containing the data, mapping the allocated +pages to replace the input pages for the next UNIBUS data input. +.PP +The following utility routines are available for use in +writing network interface drivers; all use the +structures described above. +.LP +if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx); +.br +if_ubainit(ifuba, uban, hlen, nmr); +.IP +\fIif_ubaminit\fP allocates resources on UNIBUS adapter \fIuban\fP, +storing the information in the \fIifubinfo\fP, \fIifrw\fP and \fIifxmt\fP +structures referenced. +The \fIifr\fP and \fIifx\fP parameters are pointers to arrays +of \fIifrw\fP and \fIifxmt\fP structures whose dimensions +are \fInr\fP and \fInx\fP, respectively. +\fIif_ubainit\fP is a simpler, backwards-compatible interface used +for hardware with single buffers of each type. +They are called only at boot time or after a UNIBUS reset. +One data path (buffered or unbuffered, +depending on the \fIifu_flags\fP field) is allocated for each buffer. +The \fInmr\fP parameter indicates +the number of UNIBUS mapping registers required to map a maximal +sized packet onto the UNIBUS, while \fIhlen\fP specifies the size +of a local network header, if any, which should be mapped separately +from the data (see the description of trailer protocols in chapter 14). +Sufficient UNIBUS mapping registers and pages of memory are allocated +to initialize the input data path for an initial read. For the output +data path, mapping registers and pages of memory are also allocated +and mapped onto the UNIBUS. The pages associated with the output +data path are held in reserve in the event a write requires copying +non-page-aligned data (see \fIif_wubaput\fP below). +If \fIif_ubainit\fP is called with memory pages already allocated, +they will be used instead of allocating new ones (this normally +occurs after a UNIBUS reset). +A 1 is returned when allocation and initialization are successful, +0 otherwise. +.LP +m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp); +.br +m = if_rubaget(ifuba, totlen, off0, ifp); +.IP +\fIif_ubaget\fP and \fIif_rubaget\fP pull input data +out of an interface receive buffer and into an mbuf chain. +The first interface passes pointers to the \fIifubinfo\fP structure +for the interface and the \fIifrw\fP structure for the receive buffer; +the second call may be used for single-buffered devices. +\fItotlen\fP specifies the length of data to be obtained, not counting the +local network header. If \fIoff0\fP is non-zero, it indicates +a byte offset to a trailing local network header which should be +copied into a separate mbuf and prepended to the front of the resultant mbuf +chain. When the data amount to at least a half a page, +the previously mapped data pages are remapped +into the mbufs and swapped with fresh pages, thus avoiding +any copy. +The receiving interface is recorded as \fIifp\fP, a pointer to an \fIifnet\fP +structure, for the use of the receiving network protocol. +A 0 return value indicates a failure to allocate resources. +.LP +if_wubaput(ifubinfo, ifx, m); +.br +if_wubaput(ifuba, m); +.IP +\fIif_ubaput\fP and \fIif_wubaput\fP map a chain of mbufs +onto a network interface in preparation for output. +The first interface is used by devices with multiple transmit buffers. +The chain includes any local network +header, which is copied so that it resides in the mapped and +aligned I/O space. +Page-aligned data that are page-aligned in the output buffer +are mapped to the UNIBUS in place of the normal buffer page, +and the corresponding mbuf is placed on a queue to be freed after transmission. +Any other mbufs which contained non-page-sized +data portions are copied to the I/O space and then freed. +Pages mapped from a previous output operation (no longer needed) +are unmapped. diff --git a/share/doc/smm/18.net/7.t b/share/doc/smm/18.net/7.t new file mode 100644 index 00000000000..8ccecc881de --- /dev/null +++ b/share/doc/smm/18.net/7.t @@ -0,0 +1,256 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)7.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.br +.ne 30v +.\".ds RH "Socket/protocol interface +.NH +\s+2Socket/protocol interface\s0 +.PP +The interface between the socket routines and the communication +protocols is through the \fIpr_usrreq\fP routine defined in the +protocol switch table. The following requests to a protocol +module are possible: +.DS +._d +#define PRU_ATTACH 0 /* attach protocol */ +#define PRU_DETACH 1 /* detach protocol */ +#define PRU_BIND 2 /* bind socket to address */ +#define PRU_LISTEN 3 /* listen for connection */ +#define PRU_CONNECT 4 /* establish connection to peer */ +#define PRU_ACCEPT 5 /* accept connection from peer */ +#define PRU_DISCONNECT 6 /* disconnect from peer */ +#define PRU_SHUTDOWN 7 /* won't send any more data */ +#define PRU_RCVD 8 /* have taken data; more room now */ +#define PRU_SEND 9 /* send this data */ +#define PRU_ABORT 10 /* abort (fast DISCONNECT, DETATCH) */ +#define PRU_CONTROL 11 /* control operations on protocol */ +#define PRU_SENSE 12 /* return status into m */ +#define PRU_RCVOOB 13 /* retrieve out of band data */ +#define PRU_SENDOOB 14 /* send out of band data */ +#define PRU_SOCKADDR 15 /* fetch socket's address */ +#define PRU_PEERADDR 16 /* fetch peer's address */ +#define PRU_CONNECT2 17 /* connect two sockets */ +/* begin for protocols internal use */ +#define PRU_FASTTIMO 18 /* 200ms timeout */ +#define PRU_SLOWTIMO 19 /* 500ms timeout */ +#define PRU_PROTORCV 20 /* receive from below */ +#define PRU_PROTOSEND 21 /* send to below */ +.DE +A call on the user request routine is of the form, +.DS +._f +error = (*protosw[].pr_usrreq)(so, req, m, addr, rights); +int error; struct socket *so; int req; struct mbuf *m, *addr, *rights; +.DE +The mbuf data chain \fIm\fP is supplied for output operations +and for certain other operations where it is to receive a result. +The address \fIaddr\fP is supplied for address-oriented requests +such as PRU_BIND and PRU_CONNECT. +The \fIrights\fP parameter is an optional pointer to an mbuf +chain containing user-specified capabilities (see the \fIsendmsg\fP +and \fIrecvmsg\fP system calls). The protocol is responsible for +disposal of the data mbuf chains on output operations. +A non-zero return value gives a +UNIX error number which should be passed to higher level software. +The following paragraphs describe each +of the requests possible. +.IP PRU_ATTACH +.br +When a protocol is bound to a socket (with the \fIsocket\fP +system call) the protocol module is called with this +request. It is the responsibility of the protocol module to +allocate any resources necessary. +The ``attach'' request +will always precede any of the other requests, and should not +occur more than once. +.IP PRU_DETACH +.br +This is the antithesis of the attach request, and is used +at the time a socket is deleted. The protocol module may +deallocate any resources assigned to the socket. +.IP PRU_BIND +.br +When a socket is initially created it has no address bound +to it. This request indicates that an address should be bound to +an existing socket. The protocol module must verify that the +requested address is valid and available for use. +.IP PRU_LISTEN +.br +The ``listen'' request indicates the user wishes to listen +for incoming connection requests on the associated socket. +The protocol module should perform any state changes needed +to carry out this request (if possible). A ``listen'' request +always precedes a request to accept a connection. +.IP PRU_CONNECT +.br +The ``connect'' request indicates the user wants to a establish +an association. The \fIaddr\fP parameter supplied describes +the peer to be connected to. The effect of a connect request +may vary depending on the protocol. Virtual circuit protocols, +such as TCP [Postel81b], use this request to initiate establishment of a +TCP connection. Datagram protocols, such as UDP [Postel80], simply +record the peer's address in a private data structure and use +it to tag all outgoing packets. There are no restrictions +on how many times a connect request may be used after an attach. +If a protocol supports the notion of \fImulti-casting\fP, it +is possible to use multiple connects to establish a multi-cast +group. Alternatively, an association may be broken by a +PRU_DISCONNECT request, and a new association created with a +subsequent connect request; all without destroying and creating +a new socket. +.IP PRU_ACCEPT +.br +Following a successful PRU_LISTEN request and the arrival +of one or more connections, this request is made to +indicate the user +has accepted the first connection on the queue of +pending connections. The protocol module should fill +in the supplied address buffer with the address of the +connected party. +.IP PRU_DISCONNECT +.br +Eliminate an association created with a PRU_CONNECT request. +.IP PRU_SHUTDOWN +.br +This call is used to indicate no more data will be sent and/or +received (the \fIaddr\fP parameter indicates the direction of +the shutdown, as encoded in the \fIsoshutdown\fP system call). +The protocol may, at its discretion, deallocate any data +structures related to the shutdown and/or notify a connected peer +of the shutdown. +.IP PRU_RCVD +.br +This request is made only if the protocol entry in the protocol +switch table includes the PR_WANTRCVD flag. +When a user removes data from the receive queue this request +will be sent to the protocol module. It may be used to trigger +acknowledgements, refresh windowing information, initiate +data transfer, etc. +.IP PRU_SEND +.br +Each user request to send data is translated into one or more +PRU_SEND requests (a protocol may indicate that a single user +send request must be translated into a single PRU_SEND request by +specifying the PR_ATOMIC flag in its protocol description). +The data to be sent is presented to the protocol as a list of +mbufs and an address is, optionally, supplied in the \fIaddr\fP +parameter. The protocol is responsible for preserving the data +in the socket's send queue if it is not able to send it immediately, +or if it may need it at some later time (e.g. for retransmission). +.IP PRU_ABORT +.br +This request indicates an abnormal termination of service. The +protocol should delete any existing association(s). +.IP PRU_CONTROL +.br +The ``control'' request is generated when a user performs a +UNIX \fIioctl\fP system call on a socket (and the ioctl is not +intercepted by the socket routines). It allows protocol-specific +operations to be provided outside the scope of the common socket +interface. The \fIaddr\fP parameter contains a pointer to a static +kernel data area where relevant information may be obtained or returned. +The \fIm\fP parameter contains the actual \fIioctl\fP request code +(note the non-standard calling convention). +The \fIrights\fP parameter contains a pointer to an \fIifnet\fP structure +if the \fIioctl\fP operation pertains to a particular network interface. +.IP PRU_SENSE +.br +The ``sense'' request is generated when the user makes an \fIfstat\fP +system call on a socket; it requests status of the associated socket. +This currently returns a standard \fIstat\fP structure. +It typically contains only the +optimal transfer size for the connection (based on buffer size, +windowing information and maximum packet size). +The \fIm\fP parameter contains a pointer +to a static kernel data area where the status buffer should be placed. +.IP PRU_RCVOOB +.br +Any ``out-of-band'' data presently available is to be returned. An +mbuf is passed to the protocol module, and the protocol +should either place +data in the mbuf or attach new mbufs to the one supplied if there is +insufficient space in the single mbuf. +An error may be returned if out-of-band data is not (yet) available +or has already been consumed. +The \fIaddr\fP parameter contains any options such as MSG_PEEK +to examine data without consuming it. +.IP PRU_SENDOOB +.br +Like PRU_SEND, but for out-of-band data. +.IP PRU_SOCKADDR +.br +The local address of the socket is returned, if any is currently +bound to it. The address (with protocol specific format) is returned +in the \fIaddr\fP parameter. +.IP PRU_PEERADDR +.br +The address of the peer to which the socket is connected is returned. +The socket must be in a SS_ISCONNECTED state for this request to +be made to the protocol. The address format (protocol specific) is +returned in the \fIaddr\fP parameter. +.IP PRU_CONNECT2 +.br +The protocol module is supplied two sockets and requested to +establish a connection between the two without binding any +addresses, if possible. This call is used in implementing +the +.IR socketpair (2) +system call. +.PP +The following requests are used internally by the protocol modules +and are never generated by the socket routines. In certain instances, +they are handed to the \fIpr_usrreq\fP routine solely for convenience +in tracing a protocol's operation (e.g. PRU_SLOWTIMO). +.IP PRU_FASTTIMO +.br +A ``fast timeout'' has occurred. This request is made when a timeout +occurs in the protocol's \fIpr_fastimo\fP routine. The \fIaddr\fP +parameter indicates which timer expired. +.IP PRU_SLOWTIMO +.br +A ``slow timeout'' has occurred. This request is made when a timeout +occurs in the protocol's \fIpr_slowtimo\fP routine. The \fIaddr\fP +parameter indicates which timer expired. +.IP PRU_PROTORCV +.br +This request is used in the protocol-protocol interface, not by the +routines. It requests reception of data destined for the protocol and +not the user. No protocols currently use this facility. +.IP PRU_PROTOSEND +.br +This request allows a protocol to send data destined for another +protocol module, not a user. The details of how data is marked +``addressed to protocol'' instead of ``addressed to user'' are +left to the protocol modules. No protocols currently use this facility. diff --git a/share/doc/smm/18.net/8.t b/share/doc/smm/18.net/8.t new file mode 100644 index 00000000000..e65e65665f1 --- /dev/null +++ b/share/doc/smm/18.net/8.t @@ -0,0 +1,166 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)8.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Protocol/protocol interface +.br +.ne 2i +.NH +\s+2Protocol/protocol interface\s0 +.PP +The interface between protocol modules is through the \fIpr_usrreq\fP, +\fIpr_input\fP, \fIpr_output\fP, \fIpr_ctlinput\fP, and +\fIpr_ctloutput\fP routines. The calling conventions for all +but the \fIpr_usrreq\fP routine are expected to be specific to +the protocol +modules and are not guaranteed to be consistent across protocol +families. We +will examine the conventions used for some of the Internet +protocols in this section as an example. +.NH 2 +pr_output +.PP +The Internet protocol UDP uses the convention, +.DS +error = udp_output(inp, m); +int error; struct inpcb *inp; struct mbuf *m; +.DE +where the \fIinp\fP, ``\fIin\fP\^ternet +\fIp\fP\^rotocol \fIc\fP\^ontrol \fIb\fP\^lock'', +passed between modules conveys per connection state information, and +the mbuf chain contains the data to be sent. UDP +performs consistency checks, appends its header, calculates a +checksum, etc. before passing the packet on. +UDP is based on the Internet Protocol, IP [Postel81a], as its transport. +UDP passes a packet to the IP module for output as follows: +.DS +error = ip_output(m, opt, ro, flags); +int error; struct mbuf *m, *opt; struct route *ro; int flags; +.DE +.PP +The call to IP's output routine is more complicated than that for +UDP, as befits the additional work the IP module must do. +The \fIm\fP parameter is the data to be sent, and the \fIopt\fP +parameter is an optional list of IP options which should +be placed in the IP packet header. The \fIro\fP parameter is +is used in making routing decisions (and passing them back to the +caller for use in subsequent calls). The +final parameter, \fIflags\fP contains flags indicating whether the +user is allowed to transmit a broadcast packet +and if routing is to be performed. The broadcast flag may +be inconsequential if the underlying hardware does not support the +notion of broadcasting. +.PP +All output routines return 0 on success and a UNIX error number +if a failure occurred which could be detected immediately +(no buffer space available, no route to destination, etc.). +.NH 2 +pr_input +.PP +Both UDP and TCP use the following calling convention, +.DS +(void) (*protosw[].pr_input)(m, ifp); +struct mbuf *m; struct ifnet *ifp; +.DE +Each mbuf list passed is a single packet to be processed by +the protocol module. +The interface from which the packet was received is passed as the second +parameter. +.PP +The IP input routine is a VAX software interrupt level routine, +and so is not called with any parameters. It instead communicates +with network interfaces through a queue, \fIipintrq\fP, which is +identical in structure to the queues used by the network interfaces +for storing packets awaiting transmission. +The software interrupt is enabled by the network interfaces +when they place input data on the input queue. +.NH 2 +pr_ctlinput +.PP +This routine is used to convey ``control'' information to a +protocol module (i.e. information which might be passed to the +user, but is not data). +.PP +The common calling convention for this routine is, +.DS +(void) (*protosw[].pr_ctlinput)(req, addr); +int req; struct sockaddr *addr; +.DE +The \fIreq\fP parameter is one of the following, +.DS +.ta \w'#define 'u +\w'PRC_UNREACH_NEEDFRAG 'u +8n +#define PRC_IFDOWN 0 /* interface transition */ +#define PRC_ROUTEDEAD 1 /* select new route if possible */ +#define PRC_QUENCH 4 /* some said to slow down */ +#define PRC_MSGSIZE 5 /* message size forced drop */ +#define PRC_HOSTDEAD 6 /* normally from IMP */ +#define PRC_HOSTUNREACH 7 /* ditto */ +#define PRC_UNREACH_NET 8 /* no route to network */ +#define PRC_UNREACH_HOST 9 /* no route to host */ +#define PRC_UNREACH_PROTOCOL 10 /* dst says bad protocol */ +#define PRC_UNREACH_PORT 11 /* bad port # */ +#define PRC_UNREACH_NEEDFRAG 12 /* IP_DF caused drop */ +#define PRC_UNREACH_SRCFAIL 13 /* source route failed */ +#define PRC_REDIRECT_NET 14 /* net routing redirect */ +#define PRC_REDIRECT_HOST 15 /* host routing redirect */ +#define PRC_REDIRECT_TOSNET 14 /* redirect for type of service & net */ +#define PRC_REDIRECT_TOSHOST 15 /* redirect for tos & host */ +#define PRC_TIMXCEED_INTRANS 18 /* packet lifetime expired in transit */ +#define PRC_TIMXCEED_REASS 19 /* lifetime expired on reass q */ +#define PRC_PARAMPROB 20 /* header incorrect */ +.DE +while the \fIaddr\fP parameter is the address to which the condition applies. +Many of the requests have obviously been +derived from ICMP (the Internet Control Message Protocol [Postel81c]), +and from error messages defined in the 1822 host/IMP convention +[BBN78]. Mapping tables exist to convert +control requests to UNIX error codes which are delivered +to a user. +.NH 2 +pr_ctloutput +.PP +This is the routine that implements per-socket options at the protocol +level for \fIgetsockopt\fP and \fIsetsockopt\fP. +The calling convention is, +.DS +error = (*protosw[].pr_ctloutput)(op, so, level, optname, mp); +int op; struct socket *so; int level, optname; struct mbuf **mp; +.DE +where \fIop\fP is one of PRCO_SETOPT or PRCO_GETOPT, +\fIso\fP is the socket from whence the call originated, +and \fIlevel\fP and \fIoptname\fP are the protocol level and option name +supplied by the user. +The results of a PRCO_GETOPT call are returned in an mbuf whose address +is placed in \fImp\fP before return. +On a PRCO_SETOPT call, \fImp\fP contains the address of an mbuf +containing the option data; the mbuf should be freed before return. diff --git a/share/doc/smm/18.net/9.t b/share/doc/smm/18.net/9.t new file mode 100644 index 00000000000..506037a4297 --- /dev/null +++ b/share/doc/smm/18.net/9.t @@ -0,0 +1,124 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)9.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Protocol/network-interface +.br +.ne 2i +.NH +\s+2Protocol/network-interface interface\s0 +.PP +The lowest layer in the set of protocols which comprise a +protocol family must interface itself to one or more network +interfaces in order to transmit and receive +packets. It is assumed that +any routing decisions have been made before handing a packet +to a network interface, in fact this is absolutely necessary +in order to locate any interface at all (unless, of course, +one uses a single ``hardwired'' interface). There are two +cases with which to be concerned, transmission of a packet +and receipt of a packet; each will be considered separately. +.NH 2 +Packet transmission +.PP +Assuming a protocol has a handle on an interface, \fIifp\fP, +a (struct ifnet\ *), +it transmits a fully formatted packet with the following call, +.DS +error = (*ifp->if_output)(ifp, m, dst) +int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst; +.DE +The output routine for the network interface transmits the packet +\fIm\fP to the \fIdst\fP address, or returns an error indication +(a UNIX error number). In reality transmission may +not be immediate or successful; normally the output +routine simply queues the packet on its send queue and primes +an interrupt driven routine to actually transmit the packet. +For unreliable media, such as the Ethernet, ``successful'' +transmission simply means that the packet has been placed on the cable +without a collision. On the other hand, an 1822 interface guarantees +proper delivery or an error indication for each message transmitted. +The model employed in the networking system attaches no promises +of delivery to the packets handed to a network interface, and thus +corresponds more closely to the Ethernet. Errors returned by the +output routine are only those that can be detected immediately, +and are normally trivial in nature (no buffer space, +address format not handled, etc.). +No indication is received if errors are detected after the call has returned. +.NH 2 +Packet reception +.PP +Each protocol family must have one or more ``lowest level'' protocols. +These protocols deal with internetwork addressing and are responsible +for the delivery of incoming packets to the proper protocol processing +modules. In the PUP model [Boggs78] these protocols are termed Level +1 protocols, +in the ISO model, network layer protocols. In this system each such +protocol module has an input packet queue assigned to it. Incoming +packets received by a network interface are queued for the protocol +module, and a VAX software interrupt is posted to initiate processing. +.PP +Three macros are available for queuing and dequeuing packets: +.IP "IF_ENQUEUE(ifq, m)" +.br +This places the packet \fIm\fP at the tail of the queue \fIifq\fP. +.IP "IF_DEQUEUE(ifq, m)" +.br +This places a pointer to the packet at the head of queue \fIifq\fP +in \fIm\fP +and removes the packet from the queue. +A zero value will be returned in \fIm\fP if the queue is empty. +.IP "IF_DEQUEUEIF(ifq, m, ifp)" +.br +Like IF_DEQUEUE, this removes the next packet from the head of a queue +and returns it in \fIm\fP. +A pointer to the interface on which the packet was received +is placed in \fIifp\fP, a (struct ifnet\ *). +.IP "IF_PREPEND(ifq, m)" +.br +This places the packet \fIm\fP at the head of the queue \fIifq\fP. +.PP +Each queue has a maximum length associated with it as a simple form +of congestion control. The macro IF_QFULL(ifq) returns 1 if the queue +is filled, in which case the macro IF_DROP(ifq) should be used to +increment the count of the number of packets dropped, and the offending +packet is dropped. For example, the following code fragment is commonly +found in a network interface's input routine, +.DS +._f +if (IF_QFULL(inq)) { + IF_DROP(inq); + m_freem(m); +} else + IF_ENQUEUE(inq, m); +.DE diff --git a/share/doc/smm/18.net/Makefile b/share/doc/smm/18.net/Makefile new file mode 100644 index 00000000000..fde100a0040 --- /dev/null +++ b/share/doc/smm/18.net/Makefile @@ -0,0 +1,10 @@ +# @(#)Makefile 8.1 (Berkeley) 6/10/93 + +DIR= smm/18.net +SRCS= 0.t 1.t 2.t 3.t 4.t 5.t 6.t 7.t 8.t 9.t a.t b.t c.t d.t e.t f.t +MACROS= -ms + +paper.ps: ${SRCS} + ${TBL} ${SRCS} | ${ROFF} > ${.TARGET} + +.include <bsd.doc.mk> diff --git a/share/doc/smm/18.net/a.t b/share/doc/smm/18.net/a.t new file mode 100644 index 00000000000..dddba57807b --- /dev/null +++ b/share/doc/smm/18.net/a.t @@ -0,0 +1,219 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)a.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Gateways and routing +.br +.ne 2i +.NH +\s+2Gateways and routing issues\s0 +.PP +The system has been designed with the expectation that it will +be used in an internetwork environment. The ``canonical'' +environment was envisioned to be a collection of local area +networks connected at one or more points through hosts with +multiple network interfaces (one on each local area network), +and possibly a connection to a long haul network (for example, +the ARPANET). In such an environment, issues of +gatewaying and packet routing become very important. Certain +of these issues, such as congestion +control, have been handled in a simplistic manner or specifically +not addressed. +Instead, where possible, the network system +attempts to provide simple mechanisms upon which more involved +policies may be implemented. As some of these problems become +better understood, the solutions developed will be incorporated +into the system. +.PP +This section will describe the facilities provided for packet +routing. The simplistic mechanisms provided for congestion +control are described in chapter 12. +.NH 2 +Routing tables +.PP +The network system maintains a set of routing tables for +selecting a network interface to use in delivering a +packet to its destination. These tables are of the form: +.DS +.ta \w'struct 'u +\w'u_long 'u +\w'sockaddr rt_gateway; 'u +struct rtentry { + u_long rt_hash; /* hash key for lookups */ + struct sockaddr rt_dst; /* destination net or host */ + struct sockaddr rt_gateway; /* forwarding agent */ + short rt_flags; /* see below */ + short rt_refcnt; /* no. of references to structure */ + u_long rt_use; /* packets sent using route */ + struct ifnet *rt_ifp; /* interface to give packet to */ +}; +.DE +.PP +The routing information is organized in two separate tables, one +for routes to a host and one for routes to a network. The +distinction between hosts and networks is necessary so +that a single mechanism may be used +for both broadcast and multi-drop type networks, and +also for networks built from point-to-point links (e.g +DECnet [DEC80]). +.PP +Each table is organized as a hashed set of linked lists. +Two 32-bit hash values are calculated by routines defined for +each address family; one based on the destination being +a host, and one assuming the target is the network portion +of the address. Each hash value is used to +locate a hash chain to search (by taking the value modulo the +hash table size) and the entire 32-bit value is then +used as a key in scanning the list of routes. Lookups are +applied first to the routing +table for hosts, then to the routing table for networks. +If both lookups fail, a final lookup is made for a ``wildcard'' +route (by convention, network 0). +The first appropriate route discovered is used. +By doing this, routes to a specific host on a network may be +present as well as routes to the network. This also allows a +``fall back'' network route to be defined to a ``smart'' gateway +which may then perform more intelligent routing. +.PP +Each routing table entry contains a destination (the desired final destination), +a gateway to which to send the packet, +and various flags which indicate the route's status and type (host or +network). A count +of the number of packets sent using the route is kept, along +with a count of ``held references'' to the dynamically +allocated structure to insure that memory reclamation +occurs only when the route is not in use. Finally, a pointer to the +a network interface is kept; packets sent using +the route should be handed to this interface. +.PP +Routes are typed in two ways: either as host or network, and as +``direct'' or ``indirect''. The host/network +distinction determines how to compare the \fIrt_dst\fP field +during lookup. If the route is to a network, only a packet's +destination network is compared to the \fIrt_dst\fP entry stored +in the table. If the route is to a host, the addresses must +match bit for bit. +.PP +The distinction between ``direct'' and ``indirect'' routes indicates +whether the destination is directly connected to the source. +This is needed when performing local network encapsulation. If +a packet is destined for a peer at a host or network which is +not directly connected to the source, the internetwork packet +header will +contain the address of the eventual destination, while +the local network header will address the intervening +gateway. Should the destination be directly connected, these addresses +are likely to be identical, or a mapping between the two exists. +The RTF_GATEWAY flag indicates that the route is to an ``indirect'' +gateway agent, and that the local network header should be filled in +from the \fIrt_gateway\fP field instead of +from the final internetwork destination address. +.PP +It is assumed that multiple routes to the same destination will not +be present; only one of multiple routes, that most recently installed, +will be used. +.PP +Routing redirect control messages are used to dynamically +modify existing routing table entries as well as dynamically +create new routing table entries. On hosts where exhaustive +routing information is too expensive to maintain (e.g. work +stations), the +combination of wildcard routing entries and routing redirect +messages can be used to provide a simple routing management +scheme without the use of a higher level policy process. +Current connections may be rerouted after notification of the protocols +by means of their \fIpr_ctlinput\fP entries. +Statistics are kept by the routing table routines +on the use of routing redirect messages and their +affect on the routing tables. These statistics may be viewed using +.IR netstat (1). +.PP +Status information other than routing redirect control messages +may be used in the future, but at present they are ignored. +Likewise, more intelligent ``metrics'' may be used to describe +routes in the future, possibly based on bandwidth and monetary +costs. +.NH 2 +Routing table interface +.PP +A protocol accesses the routing tables through +three routines, +one to allocate a route, one to free a route, and one +to process a routing redirect control message. +The routine \fIrtalloc\fP performs route allocation; it is +called with a pointer to the following structure containing +the desired destination: +.DS +._f +struct route { + struct rtentry *ro_rt; + struct sockaddr ro_dst; +}; +.DE +The route returned is assumed ``held'' by the caller until +released with an \fIrtfree\fP call. Protocols which implement +virtual circuits, such as TCP, hold onto routes for the duration +of the circuit's lifetime, while connection-less protocols, +such as UDP, allocate and free routes whenever their destination address +changes. +.PP +The routine \fIrtredirect\fP is called to process a routing redirect +control message. It is called with a destination address, +the new gateway to that destination, and the source of the redirect. +Redirects are accepted only from the current router for the destination. +If a non-wildcard route +exists to the destination, the gateway entry in the route is modified +to point at the new gateway supplied. Otherwise, a new routing +table entry is inserted reflecting the information supplied. Routes +to interfaces and routes to gateways which are not directly accessible +from the host are ignored. +.NH 2 +User level routing policies +.PP +Routing policies implemented in user processes manipulate the +kernel routing tables through two \fIioctl\fP calls. The +commands SIOCADDRT and SIOCDELRT add and delete routing entries, +respectively; the tables are read through the /dev/kmem device. +The decision to place policy decisions in a user process implies +that routing table updates may lag a bit behind the identification of +new routes, or the failure of existing routes, but this period +of instability is normally very small with proper implementation +of the routing process. Advisory information, such as ICMP +error messages and IMP diagnostic messages, may be read from +raw sockets (described in the next section). +.PP +Several routing policy processes have already been implemented. The +system standard +``routing daemon'' uses a variant of the Xerox NS Routing Information +Protocol [Xerox82] to maintain up-to-date routing tables in our local +environment. Interaction with other existing routing protocols, +such as the Internet EGP (Exterior Gateway Protocol), has been +accomplished using a similar process. diff --git a/share/doc/smm/18.net/b.t b/share/doc/smm/18.net/b.t new file mode 100644 index 00000000000..2e39a8ad82f --- /dev/null +++ b/share/doc/smm/18.net/b.t @@ -0,0 +1,145 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)b.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Raw sockets +.br +.ne 2i +.NH +\s+2Raw sockets\s0 +.PP +A raw socket is an object which allows users direct access +to a lower-level protocol. Raw sockets are intended for knowledgeable +processes which wish to take advantage of some protocol +feature not directly accessible through the normal interface, or +for the development of new protocols built atop existing lower level +protocols. For example, a new version of TCP might be developed at the +user level by utilizing a raw IP socket for delivery of packets. +The raw IP socket interface attempts to provide an identical interface +to the one a protocol would have if it were resident in the kernel. +.PP +The raw socket support is built around a generic raw socket interface, +(possibly) augmented by protocol-specific processing routines. +This section will describe the core of the raw socket interface. +.NH 2 +Control blocks +.PP +Every raw socket has a protocol control block of the following form: +.DS +.ta \w'struct 'u +\w'caddr_t 'u +\w'sockproto rcb_proto; 'u +struct rawcb { + struct rawcb *rcb_next; /* doubly linked list */ + struct rawcb *rcb_prev; + struct socket *rcb_socket; /* back pointer to socket */ + struct sockaddr rcb_faddr; /* destination address */ + struct sockaddr rcb_laddr; /* socket's address */ + struct sockproto rcb_proto; /* protocol family, protocol */ + caddr_t rcb_pcb; /* protocol specific stuff */ + struct mbuf *rcb_options; /* protocol specific options */ + struct route rcb_route; /* routing information */ + short rcb_flags; +}; +.DE +All the control blocks are kept on a doubly linked list for +performing lookups during packet dispatch. Associations may +be recorded in the control block and used by the output routine +in preparing packets for transmission. +The \fIrcb_proto\fP structure contains the protocol family and protocol +number with which the raw socket is associated. +The protocol, family and addresses are +used to filter packets on input; this will be described in more +detail shortly. If any protocol-specific information is required, +it may be attached to the control block using the \fIrcb_pcb\fP +field. +Protocol-specific options for transmission in outgoing packets +may be stored in \fIrcb_options\fP. +.PP +A raw socket interface is datagram oriented. That is, each send +or receive on the socket requires a destination address. This +address may be supplied by the user or stored in the control block +and automatically installed in the outgoing packet by the output +routine. Since it is not possible to determine whether an address +is present or not in the control block, two flags, RAW_LADDR and +RAW_FADDR, indicate if a local and foreign address are present. +Routing is expected to be performed by the underlying protocol +if necessary. +.NH 2 +Input processing +.PP +Input packets are ``assigned'' to raw sockets based on a simple +pattern matching scheme. Each network interface or protocol +gives unassigned packets +to the raw input routine with the call: +.DS +raw_input(m, proto, src, dst) +struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst; +.DE +The data packet then has a generic header prepended to it of the +form +.DS +._f +struct raw_header { + struct sockproto raw_proto; + struct sockaddr raw_dst; + struct sockaddr raw_src; +}; +.DE +and it is placed in a packet queue for the ``raw input protocol'' module. +Packets taken from this queue are copied into any raw sockets that +match the header according to the following rules, +.IP 1) +The protocol family of the socket and header agree. +.IP 2) +If the protocol number in the socket is non-zero, then it agrees +with that found in the packet header. +.IP 3) +If a local address is defined for the socket, the address format +of the local address is the same as the destination address's and +the two addresses agree bit for bit. +.IP 4) +The rules of 3) are applied to the socket's foreign address and the packet's +source address. +.LP +A basic assumption is that addresses present in the +control block and packet header (as constructed by the network +interface and any raw input protocol module) are in a canonical +form which may be ``block compared''. +.NH 2 +Output processing +.PP +On output the raw \fIpr_usrreq\fP routine +passes the packet and a pointer to the raw control block to the +raw protocol output routine for any processing required before +it is delivered to the appropriate network interface. The +output routine is normally the only code required to implement +a raw socket interface. diff --git a/share/doc/smm/18.net/c.t b/share/doc/smm/18.net/c.t new file mode 100644 index 00000000000..2c7f752cd6e --- /dev/null +++ b/share/doc/smm/18.net/c.t @@ -0,0 +1,151 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)c.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Buffering and congestion control +.br +.ne 2i +.NH +\s+2Buffering and congestion control\s0 +.PP +One of the major factors in the performance of a protocol is +the buffering policy used. Lack of a proper buffering policy +can force packets to be dropped, cause falsified windowing +information to be emitted by protocols, fragment host memory, +degrade the overall host performance, etc. Due to problems +such as these, most systems allocate a fixed pool of memory +to the networking system and impose +a policy optimized for ``normal'' network operation. +.PP +The networking system developed for UNIX is little different in this +respect. At boot time a fixed amount of memory is allocated by +the networking system. At later times more system memory +may be requested as the need arises, but at no time is +memory ever returned to the system. It is possible to +garbage collect memory from the network, but difficult. In +order to perform this garbage collection properly, some +portion of the network will have to be ``turned off'' as +data structures are updated. The interval over which this +occurs must kept small compared to the average inter-packet +arrival time, or too much traffic may +be lost, impacting other hosts on the network, as well as +increasing load on the interconnecting mediums. In our +environment we have not experienced a need for such compaction, +and thus have left the problem unresolved. +.PP +The mbuf structure was introduced in chapter 5. In this +section a brief description will be given of the allocation +mechanisms, and policies used by the protocols in performing +connection level buffering. +.NH 2 +Memory management +.PP +The basic memory allocation routines manage a private page map, +the size of which determines the maximum amount of memory +that may be allocated by the network. +A small amount of memory is allocated at boot time +to initialize the mbuf and mbuf page cluster free lists. +When the free lists are exhausted, more memory is requested +from the system memory allocator if space remains in the map. +If memory cannot be allocated, +callers may block awaiting free memory, +or the failure may be reflected to the caller immediately. +The allocator will not block awaiting free map entries, however, +as exhaustion of the page map usually indicates that buffers have been lost +due to a ``leak.'' +The private page table is used by the network buffer management +routines in remapping pages to +be logically contiguous as the need arises. In addition, an +array of reference counts parallels the page table and is used +when multiple references to a page are present. +.PP +Mbufs are 128 byte structures, 8 fitting in a 1Kbyte +page of memory. When data is placed in mbufs, +it is copied or remapped into logically contiguous pages of +memory from the network page pool if possible. +Data smaller than half of the size +of a page is copied into one or more 112 byte mbuf data areas. +.NH 2 +Protocol buffering policies +.PP +Protocols reserve fixed amounts of +buffering for send and receive queues at socket creation time. These +amounts define the high and low water marks used by the socket routines +in deciding when to block and unblock a process. The reservation +of space does not currently +result in any action by the memory management +routines. +.PP +Protocols which provide connection level flow control do this +based on the amount of space in the associated socket queues. That +is, send windows are calculated based on the amount of free space +in the socket's receive queue, while receive windows are adjusted +based on the amount of data awaiting transmission in the send queue. +Care has been taken to avoid the ``silly window syndrome'' described +in [Clark82] at both the sending and receiving ends. +.NH 2 +Queue limiting +.PP +Incoming packets from the network are always received unless +memory allocation fails. However, each Level 1 protocol +input queue +has an upper bound on the queue's length, and any packets +exceeding that bound are discarded. It is possible for a host to be +overwhelmed by excessive network traffic (for instance a host +acting as a gateway from a high bandwidth network to a low bandwidth +network). As a ``defensive'' mechanism the queue limits may be +adjusted to throttle network traffic load on a host. +Consider a host willing to devote some percentage of +its machine to handling network traffic. +If the cost of handling an +incoming packet can be calculated so that an acceptable +``packet handling rate'' +can be determined, then input queue lengths may be dynamically +adjusted based on a host's network load and the number of packets +awaiting processing. Obviously, discarding packets is +not a satisfactory solution to a problem such as this +(simply dropping packets is likely to increase the load on a network); +the queue lengths were incorporated mainly as a safeguard mechanism. +.NH 2 +Packet forwarding +.PP +When packets can not be forwarded because of memory limitations, +the system attempts to generate a ``source quench'' message. In addition, +any other problems encountered during packet forwarding are also +reflected back to the sender in the form of ICMP packets. This +helps hosts avoid unneeded retransmissions. +.PP +Broadcast packets are never forwarded due to possible dire +consequences. In an early stage of network development, broadcast +packets were forwarded and a ``routing loop'' resulted in network +saturation and every host on the network crashing. diff --git a/share/doc/smm/18.net/d.t b/share/doc/smm/18.net/d.t new file mode 100644 index 00000000000..675bece2aa7 --- /dev/null +++ b/share/doc/smm/18.net/d.t @@ -0,0 +1,73 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)d.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Out of band data +.br +.ne 2i +.NH +\s+2Out of band data\s0 +.PP +Out of band data is a facility peculiar to the stream socket +abstraction defined. Little agreement appears to exist as +to what its semantics should be. TCP defines the notion of +``urgent data'' as in-line, while the NBS protocols [Burruss81] +and numerous others provide a fully independent logical +transmission channel along which out of band data is to be +sent. +In addition, the amount of the data which may be sent as an out +of band message varies from protocol to protocol; everything +from 1 bit to 16 bytes or more. +.PP +A stream socket's notion of out of band data has been defined +as the lowest reasonable common denominator (at least reasonable +in our minds); +clearly this is subject to debate. Out of band data is expected +to be transmitted out of the normal sequencing and flow control +constraints of the data stream. A minimum of 1 byte of out of +band data and one outstanding out of band message are expected to +be supported by the protocol supporting a stream socket. +It is a protocol's prerogative to support larger-sized messages, or +more than one outstanding out of band message at a time. +.PP +Out of band data is maintained by the protocol and is usually not +stored in the socket's receive queue. +A socket-level option, SO_OOBINLINE, +is provided to force out-of-band data to be placed in the normal +receive queue when urgent data is received; +this sometimes amelioriates problems due to loss of data +when multiple out-of-band +segments are received before the first has been passed to the user. +The PRU_SENDOOB and PRU_RCVOOB +requests to the \fIpr_usrreq\fP routine are used in sending and +receiving data. diff --git a/share/doc/smm/18.net/e.t b/share/doc/smm/18.net/e.t new file mode 100644 index 00000000000..77e8a2aaec7 --- /dev/null +++ b/share/doc/smm/18.net/e.t @@ -0,0 +1,129 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)e.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH "Trailer protocols +.br +.ne 2i +.NH +\s+2Trailer protocols\s0 +.PP +Core to core copies can be expensive. +Consequently, a great deal of effort was spent +in minimizing such operations. The VAX architecture +provides virtual memory hardware organized in +page units. To cut down on copy operations, data +is kept in page-sized units on page-aligned +boundaries whenever possible. This allows data +to be moved in memory simply by remapping the page +instead of copying. The mbuf and network +interface routines perform page table manipulations +where needed, hiding the complexities of the VAX +virtual memory hardware from higher level code. +.PP +Data enters the system in two ways: from the user, +or from the network (hardware interface). When data +is copied from the user's address space +into the system it is deposited in pages (if sufficient +data is present). +This encourages the user to transmit information in +messages which are a multiple of the system page size. +.PP +Unfortunately, performing a similar operation when taking +data from the network is very difficult. +Consider the format of an incoming packet. A packet +usually contains a local network header followed by +one or more headers used by the high level protocols. +Finally, the data, if any, follows these headers. Since +the header information may be variable length, DMA'ing the eventual +data for the user into a page aligned area of +memory is impossible without +\fIa priori\fP knowledge of the format (e.g., by supporting +only a single protocol header format). +.PP +To allow variable length header information to +be present and still ensure page alignment of data, +a special local network encapsulation may be used. +This encapsulation, termed a \fItrailer protocol\fP [Leffler84], +places the variable length header information after +the data. A fixed size local network +header is then prepended to the resultant packet. +The local network header contains the size of the +data portion (in units of 512 bytes), and a new \fItrailer protocol +header\fP, inserted before the variable length +information, contains the size of the variable length +header information. The following trailer +protocol header is used to store information +regarding the variable length protocol header: +.DS +._f +struct { + short protocol; /* original protocol no. */ + short length; /* length of trailer */ +}; +.DE +.PP +The processing of the trailer protocol is very +simple. On output, the local network header indicates that +a trailer encapsulation is being used. +The header also includes an indication +of the number of data pages present before the trailer +protocol header. The trailer protocol header is +initialized to contain the actual protocol identifier and the +variable length header size, and is appended to the data +along with the variable length header information. +.PP +On input, the interface routines identify the +trailer encapsulation +by the protocol type stored in the local network header, +then calculate the number of +pages of data to find the beginning of the trailer. +The trailing information is copied into a separate +mbuf and linked to the front of the resultant packet. +.PP +Clearly, trailer protocols require cooperation between +source and destination. In addition, they are normally +cost effective only when sizable packets are used. The +current scheme works because the local network encapsulation +header is a fixed size, allowing DMA operations +to be performed at a known offset from the first data page +being received. Should the local network header be +variable length this scheme fails. +.PP +Statistics collected indicate that as much as 200Kb/s +can be gained by using a trailer protocol with +1Kbyte packets. The average size of the variable +length header was 40 bytes (the size of a +minimal TCP/IP packet header). If hardware +supports larger sized packets, even greater gains +may be realized. diff --git a/share/doc/smm/18.net/f.t b/share/doc/smm/18.net/f.t new file mode 100644 index 00000000000..18995fd18db --- /dev/null +++ b/share/doc/smm/18.net/f.t @@ -0,0 +1,117 @@ +.\" Copyright (c) 1983, 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)f.t 8.1 (Berkeley) 6/8/93 +.\" +.nr H2 1 +.\".ds RH Acknowledgements +.br +.ne 2i +.SH +\s+2Acknowledgements\s0 +.PP +The internal structure of the system is patterned +after the Xerox PUP architecture [Boggs79], while in certain +places the Internet +protocol family has had a great deal of influence in the design. +The use of software interrupts for process invocation +is based on similar facilities found in +the VMS operating system. +Many of the +ideas related to protocol modularity, memory management, and network +interfaces are based on Rob Gurwitz's TCP/IP implementation for the +4.1BSD version of UNIX on the VAX [Gurwitz81]. +Greg Chesson explained his use of trailer encapsulations in Datakit, +instigating their use in our system. +.\".ds RH References +.nr H2 1 +.sp 2 +.ne 2i +.SH +\s+2References\s0 +.LP +.IP [Boggs79] 20 +Boggs, D. R., J. F. Shoch, E. A. Taft, and R. M. Metcalfe; +\fIPUP: An Internetwork Architecture\fP. Report CSL-79-10. +XEROX Palo Alto Research Center, July 1979. +.IP [BBN78] 20 +Bolt Beranek and Newman; +Specification for the Interconnection of Host and IMP. +BBN Technical Report 1822. May 1978. +.IP [Cerf78] 20 +Cerf, V. G.; The Catenet Model for Internetworking. +Internet Working Group, IEN 48. July 1978. +.IP [Clark82] 20 +Clark, D. D.; Window and Acknowledgement Strategy in TCP, RFC-813. +Network Information Center, SRI International. July 1982. +.IP [DEC80] 20 +Digital Equipment Corporation; \fIDECnet DIGITAL Network +Architecture \- General Description\fP. Order No. +AA-K179A-TK. October 1980. +.IP [Gurwitz81] 20 +Gurwitz, R. F.; VAX-UNIX Networking Support Project \- Implementation +Description. Internetwork Working Group, IEN 168. +January 1981. +.IP [ISO81] 20 +International Organization for Standardization. +\fIISO Open Systems Interconnection \- Basic Reference Model\fP. +ISO/TC 97/SC 16 N 719. August 1981. +.IP [Joy86] 20 +Joy, W.; Fabry, R.; Leffler, S.; McKusick, M.; and Karels, M.; +Berkeley Software Architecture Manual, 4.4BSD Edition. +\fIUNIX Programmer's Supplementary Documents\fP, Vol. 1 (PSD:5). +Computer Systems Research Group, +University of California, Berkeley. +May, 1986. +.IP [Leffler84] 20 +Leffler, S.J. and Karels, M.J.; Trailer Encapsulations, RFC-893. +Network Information Center, SRI International. +April 1984. +.IP [Postel80] 20 +Postel, J. User Datagram Protocol, RFC-768. +Network Information Center, SRI International. May 1980. +.IP [Postel81a] 20 +Postel, J., ed. Internet Protocol, RFC-791. +Network Information Center, SRI International. September 1981. +.IP [Postel81b] 20 +Postel, J., ed. Transmission Control Protocol, RFC-793. +Network Information Center, SRI International. September 1981. +.IP [Postel81c] 20 +Postel, J. Internet Control Message Protocol, RFC-792. +Network Information Center, SRI International. September 1981. +.IP [Xerox81] 20 +Xerox Corporation. \fIInternet Transport Protocols\fP. +Xerox System Integration Standard 028112. December 1981. +.IP [Zimmermann80] 20 +Zimmermann, H. OSI Reference Model \- The ISO Model of +Architecture for Open Systems Interconnection. +\fIIEEE Transactions on Communications\fP. Com-28(4); 425-432. +April 1980. diff --git a/share/doc/smm/18.net/spell.ok b/share/doc/smm/18.net/spell.ok new file mode 100644 index 00000000000..f9a387bc75d --- /dev/null +++ b/share/doc/smm/18.net/spell.ok @@ -0,0 +1,307 @@ +A,1986A +AA +ACCEPTCONN +ADDR +ARPANET +ASYNC +BBN +BBN78 +Beranek +Boggs +Boggs78 +Boggs79 +Burruss81 +CANTRCVMORE +CANTSENDMORE +CLSIZE +COLL +CONNECT2 +CONNREQUIRED +COPYALL +CSL +Catenet +Cerf +Cerf78 +Chesson +Clark82 +Com +DEC80 +DECnet +DEQUEUE +DEQUEUEIF +DETATCH +DMA +DMA'ing +DMR +DONTWAIT +Datagram +Datakit +EGP +EWOULDBLOCK +Ethernet +FADDR +FASTTIMO +Fabry +GETOPT +Gurwitz +Gurwitz's +Gurwitz81 +HASCL +HOSTDEAD +HOSTUNREACH +ICMP +IEN +IFDOWN +IFF +IFRW +INTRANS +IP +IP's +ISCONNECTED +ISCONNECTING +ISDISCONNECTING +ISO +ISO81 +Ircb +Joy86 +K179A +Karels +LADDR +LOOPBACK +Leffler +Leffler84 +M.J +MAXNUBAMR +MCLGET +MFREE +MGET +MLEN +MSG +MSGSIZE +MSIZE +Mbufs +McKusick +Metcalfe +NBIO +NEEDFRAG +NOARP +NOFDREF +NOTRAILERS +NS +Notes''SMM:15 +OOBINLINE +OSI +PARAMPROB +PEERADDR +PF +POINTOPOINT +PRC +PRCO +PRIV +PROTORCV +PROTOSEND +PRU +PS1:6 +Postel +Postel80 +Postel81a +Postel81b +Postel81c +QFULL +RCVATMARK +RCVD +RCVOOB +RFC +RH +ROUTEDEAD +RTF +S.J +SB +SEL +SENDOOB +SETOPT +SIGIO +SIOCADDRT +SIOCDELRT +SIOCGIFCONF +SIOCSIFADDR +SIOCSIFDSTADDR +SIOCSIFFLAGS +SIOGSIFFLAGS +SLOWTIMO +SMM:15 +SOCKADDR +SRCFAIL +SS +Shoch +TCP +TIMXCEED +TOSHOST +TOSNET +UDP +UNIBUS +VAX +VMS +Vol +WANTRCVD +Xerox81 +Xerox82 +Zimmermann +Zimmermann80 +addr +addrlist +adj +amelioriates +async +bdp +broadaddr +caddr +csma +ctlinput +ctloutput +daemon +dat +datagram +decapsulation +dequeuing +dev +dma'd +dom +dst +dstaddr +dtom +faddr +fastimo +fasttimo +fcntl +freem +fstat +getsockopt +hardwired +hiwat +hlen +ierrors +ifa +ifaddr +iff +ifnet +ifp +ifq +ifqueue +ifr +ifrw +ifrw.ifrw +ifu +ifu.ifu +ifuba +ifubinfo +ifw +ifx +ifxmt +info +info.iff +init +inp +inpcb +inq +ip +ipackets +ipintrq +laddr +len +loopback +lowat +m,n +mb +mbcnt +mbmax +mbuf +mbuf's +mbufs +mp +mr +mtod +mtu +netstat +nmr +nr +nx +oerrors +off0 +ontrol +oob +oobmark +op +opackets +ops +optname +pcb +pgrp +prev +proc +proto +protosw +protoswNPROTOSW +pte +pullup +q0len +qlen +qlimit +rawcb +rcb +rcv +recvmsg +ref +refcnt +regs +req +rerouted +ro +rotocol +rt +rtalloc +rtentry +rtfree +rtredirect +rubaget +sb +sel +sendmsg +setsockopt +slowtimo +snd +sockaddr +sockbuf +socketpair +sockproto +sonewconn +soshutdown +src +ta +ternet +timeo +totlen +uba +ubaalloc +ubaget +ubainit +uballoc's +ubaminit +uban +ubaput +ubinfo +udp +unibus +usrreq +virt +vm +wildcard +wmap +wubaput +x,t +xmt +xmt.ifrw +xmt.ifw +xswapd +xtofree +xxx diff --git a/share/doc/smm/Makefile b/share/doc/smm/Makefile new file mode 100644 index 00000000000..0c6ccced405 --- /dev/null +++ b/share/doc/smm/Makefile @@ -0,0 +1,39 @@ +# from: @(#)Makefile 8.1 (Berkeley) 6/10/93 +# $Id: Makefile,v 1.1 1995/10/18 08:44:15 deraadt Exp $ + +# The following modules do not build/install: +# 10.named, 13.amd + +# Missing: +# 02.config 11.timedop 12.timed + +# Missing from 4.4BSD-Lite: +# 14.uucpimpl 15.uucpnet 16.security 17.password + +BINDIR= /usr/share/doc/smm +FILES= 00.contents Makefile Title +SUBDIR= 01.setup 04.quotas 05.fastfs 06.nfs 18.net +.if exists(03.fsck) +SUBDIR+= 03.fsck +.endif +.if exists(07.lpd) +SUBDIR+= 07.lpd +.endif +.if exists(08.sendmailop) +SUBDIR+= 08.sendmailop +.endif +.if exists(09.sendmail) +SUBDIR+= 09.sendmail +.endif + +Title.ps: ${FILES} + groff Title > ${.TARGET} + +contents.ps: ${FILES} + groff -ms 00.contents > ${.TARGET} + +beforeinstall: + install -c -o ${BINOWN} -g ${BINGRP} -m 444 ${FILES} \ + ${DESTDIR}${BINDIR} + +.include <bsd.subdir.mk> diff --git a/share/doc/smm/Title b/share/doc/smm/Title new file mode 100644 index 00000000000..4dd1b89439f --- /dev/null +++ b/share/doc/smm/Title @@ -0,0 +1,203 @@ +.\" Copyright (c) 1986, 1993 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)Title 8.2 (Berkeley) 4/19/94 +.\" +.ps 18 +.vs 22 +.sp 2.75i +.ft B +.ce 2 +UNIX System Manager's Manual +(SMM) +.ps 14 +.vs 16 +.sp |4i +.ce 2 +4.4 Berkeley Software Distribution +.sp |5.75i +.ft R +.pt 12 +.vs 16 +.ce +June, 1993 +.sp |8.2i +.ce 5 +Computer Systems Research Group +Computer Science Division +Department of Electrical Engineering and Computer Science +University of California +Berkeley, California 94720 +.bp +\& +.sp |1i +.hy 0 +.ps 10 +.vs 12p +Copyright 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993 +The Regents of the University of California. All rights reserved. +.sp 2 +Other than the specific manual pages and documents listed below +as copyrighted by AT&T, +redistribution and use of this manual in source and binary forms, +with or without modification, are permitted provided that the +following conditions are met: +.sp 0.5 +.in +0.2i +.ta 0.2i +.ti -0.2i +1) Redistributions of this manual must retain the copyright +notices on this page, this list of conditions and the following disclaimer. +.ti -0.2i +2) Software or documentation that incorporates part of this manual must +reproduce the copyright notices on this page, this list of conditions and +the following disclaimer in the documentation and/or other materials +provided with the distribution. +.ti -0.2i +3) All advertising materials mentioning features or use of this software +must display the following acknowledgement: +``This product includes software developed by the University of +California, Berkeley and its contributors.'' +.ti -0.2i +4) Neither the name of the University nor the names of its contributors +may be used to endorse or promote products derived from this software +without specific prior written permission. +.in -0.2i +.sp +\fB\s-1THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +SUCH DAMAGE.\s+1\fP +.sp 2 +The Institute of Electrical and Electronics Engineers and the American +National Standards Committee X3, on Information Processing Systems have +given us permission to reprint portions of their documentation. +.sp +In the following statement, the phrase ``this text'' refers to portions +of the system documentation. +.sp 0.5 +``Portions of this text are reprinted and reproduced in +electronic form in 4.4BSD from IEEE Std 1003.1-1988, IEEE +Standard Portable Operating System Interface for Computer Environments +(POSIX), copyright 1988 by the Institute of Electrical and Electronics +Engineers, Inc. In the event of any discrepancy between these versions +and the original IEEE Standard, the original IEEE Standard is the referee +document.'' +.sp +In the following statement, the phrase ``This material'' refers to portions +of the system documentation. +.sp 0.5 +``This material is reproduced with permission from American National +Standards Committee X3, on Information Processing Systems. Computer and +Business Equipment Manufacturers Association (CBEMA), 311 First St., NW, +Suite 500, Washington, DC 20001-2178. The developmental work of +Programming Language C was completed by the X3J11 Technical Committee.'' +.sp 2 +Manual pages cron.8, icheck.8, ncheck.8, and sa.8 +and documents SMM:15, 16, and 17 +are copyright 1979, AT&T Bell Laboratories, Incorporated. +Document SMM:14 is a modification of an earlier document that +is copyrighted 1979 by AT&T Bell Laboratories, Incorporated. +Holders of \x'-1p'UNIX\v'-4p'\s-3TM\s0\v'4p'/32V, +System III, or System V software licenses are +permitted to copy these documents, or any portion of them, +as necessary for licensed use of the software, +provided this copyright notice and statement of permission +are included. +.sp 2 +The views and conclusions contained in this manual are those of the +authors and should not be interpreted as representing official policies, +either expressed or implied, of the Regents of the University of California. +.br +.ll 6.5i +.lt 6.5i +.po .75i +.in 0i +.af % i +.ds ET\" +.de HD +.po 0 +.lt 7.4i +.tl '''' +.lt +.po +'sp 18p +.if o .tl '\\*(ET''\\*(OT' +.if e .tl '\\*(OT''\\*(ET' +'sp 18p +.ns +.. +.de FO +'sp 18p +.if e .tl '\s9\\*(Dt''\\*(Ed\s0' +.if o .tl '\s9\\*(Ed''\\*(Dt\s0' +'bp +.. +.wh 0 HD +.wh -60p FO +.bp 5 +.ds ET \s9\f2Table \|of \|Contents\fP\s0 +.ds OT - % - +.ce +\f3TABLE \|OF \|CONTENTS\fP +.nr x .5i +.in +\nxu +.nf +.ta \n(.lu-\nxuR +.de xx +\\$1\f3 \a \fP\\$2 +.. +.de t +.sp 1v +.ne .5i +.cs 3 +.ti -.5i +.ss 18 +\f3\s9\\$2. \\$3\s0\fP +.ss 12 +.if t .sp .5v +.cs 3 36 +.ds Ed Section \\$2 +.ds Dt \\$3 +.so \\$1 +.. +.t /usr/src/share/man/man0/toc8 8 "System Maintenance" +.in -.5i +.cs 3 +.if n .ta 8n 16n 24n 32n 40n 48n 56n 64n 72n 80n +.if t .ta .5i 1i 1.5i 2i 2.5i 3i 3.5i 4i 4.5i 5i 5.5i 6i 6.5i |