diff options
author | Theo de Raadt <deraadt@cvs.openbsd.org> | 1995-10-18 08:53:40 +0000 |
---|---|---|
committer | Theo de Raadt <deraadt@cvs.openbsd.org> | 1995-10-18 08:53:40 +0000 |
commit | d6583bb2a13f329cf0332ef2570eb8bb8fc0e39c (patch) | |
tree | ece253b876159b39c620e62b6c9b1174642e070e /share/doc/papers/sysperf/3.t |
initial import of NetBSD tree
Diffstat (limited to 'share/doc/papers/sysperf/3.t')
-rw-r--r-- | share/doc/papers/sysperf/3.t | 694 |
1 files changed, 694 insertions, 0 deletions
diff --git a/share/doc/papers/sysperf/3.t b/share/doc/papers/sysperf/3.t new file mode 100644 index 00000000000..832ad4255ab --- /dev/null +++ b/share/doc/papers/sysperf/3.t @@ -0,0 +1,694 @@ +.\" Copyright (c) 1985 The Regents of the University of California. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)3.t 5.1 (Berkeley) 4/17/91 +.\" +.ds RH Results of our observations +.NH +Results of our observations +.PP +When 4.2BSD was first installed on several large timesharing systems +the degradation in performance was significant. +Informal measurements showed 4.2BSD providing 80% of the throughput +of 4.1BSD (based on load averages observed under a normal timesharing load). +Many of the initial problems found were because of programs that were +not part of 4.1BSD. Using the techniques described in the previous +section and standard process profiling several problems were identified. +Later work concentrated on the operation of the kernel itself. +In this section we discuss the problems uncovered; in the next +section we describe the changes made to the system. +.NH 2 +User programs +.PP +.NH 3 +Mail system +.PP +The mail system was the first culprit identified as a major +contributor to the degradation in system performance. +At Lucasfilm the mail system is heavily used +on one machine, a VAX-11/780 with eight megabytes of memory.\** +.FS +\** During part of these observations the machine had only four +megabytes of memory. +.FE +Message +traffic is usually between users on the same machine and ranges from +person-to-person telephone messages to per-organization distribution +lists. After conversion to 4.2BSD, it was +immediately noticed that mail to distribution lists of 20 or more people +caused the system load to jump by anywhere from 3 to 6 points. +The number of processes spawned by the \fIsendmail\fP program and +the messages sent from \fIsendmail\fP to the system logging +process, \fIsyslog\fP, generated significant load both from their +execution and their interference with basic system operation. The +number of context switches and disk transfers often doubled while +\fIsendmail\fP operated; the system call rate jumped dramatically. +System accounting information consistently +showed \fIsendmail\fP as the top cpu user on the system. +.NH 3 +Network servers +.PP +The network services provided in 4.2BSD add new capabilities to the system, +but are not without cost. The system uses one daemon process to accept +requests for each network service provided. The presence of many +such daemons increases the numbers of active processes and files, +and requires a larger configuration to support the same number of users. +The overhead of the routing and status updates can consume +several percent of the cpu. +Remote logins and shells incur more overhead +than their local equivalents. +For example, a remote login uses three processes and a +pseudo-terminal handler in addition to the local hardware terminal +handler. When using a screen editor, sending and echoing a single +character involves four processes on two machines. +The additional processes, context switching, network traffic, and +terminal handler overhead can roughly triple the load presented by one +local terminal user. +.NH 2 +System overhead +.PP +To measure the costs of various functions in the kernel, +a profiling system was run for a 17 hour +period on one of our general timesharing machines. +While this is not as reproducible as a synthetic workload, +it certainly represents a realistic test. +This test was run on several occasions over a three month period. +Despite the long period of time that elapsed +between the test runs the shape of the profiles, +as measured by the number of times each system call +entry point was called, were remarkably similar. +.PP +These profiles turned up several bottlenecks that are +discussed in the next section. +Several of these were new to 4.2BSD, +but most were caused by overloading of mechanisms +which worked acceptably well in previous BSD systems. +The general conclusion from our measurements was that +the ratio of user to system time had increased from +45% system / 55% user in 4.1BSD to 57% system / 43% user +in 4.2BSD. +.NH 3 +Micro-operation benchmarks +.PP +To compare certain basic system operations +between 4.1BSD and 4.2BSD a suite of benchmark +programs was constructed and run on a VAX-11/750 with 4.5 megabytes +of physical memory and two disks on a MASSBUS controller. +Tests were run with the machine operating in single user mode +under both 4.1BSD and 4.2BSD. Paging was localized to the drive +where the root file system was located. +.PP +The benchmark programs were modeled after the Kashtan benchmarks, +[Kashtan80], with identical sources compiled under each system. +The programs and their intended purpose are described briefly +before the presentation of the results. The benchmark scripts +were run twice with the results shown as the average of +the two runs. +The source code for each program and the shell scripts used during +the benchmarks are included in the Appendix. +.PP +The set of tests shown in Table 1 was concerned with +system operations other than paging. The intent of most +benchmarks is clear. The result of running \fIsignocsw\fP is +deducted from the \fIcsw\fP benchmark to calculate the context +switch overhead. The \fIexec\fP tests use two different jobs to gauge +the cost of overlaying a larger program with a smaller one +and vice versa. The +``null job'' and ``big job'' differ solely in the size of their data +segments, 1 kilobyte versus 256 kilobytes. In both cases the +text segment of the parent is larger than that of the child.\** +.FS +\** These tests should also have measured the cost of expanding the +text segment; unfortunately time did not permit running additional tests. +.FE +All programs were compiled into the default load format that causes +the text segment to be demand paged out of the file system and shared +between processes. +.KF +.DS L +.TS +center box; +l | l. +Test Description +_ +syscall perform 100,000 \fIgetpid\fP system calls +csw perform 10,000 context switches using signals +signocsw send 10,000 signals to yourself +pipeself4 send 10,000 4-byte messages to yourself +pipeself512 send 10,000 512-byte messages to yourself +pipediscard4 send 10,000 4-byte messages to child who discards +pipediscard512 send 10,000 512-byte messages to child who discards +pipeback4 exchange 10,000 4-byte messages with child +pipeback512 exchange 10,000 512-byte messages with child +forks0 fork-exit-wait 1,000 times +forks1k sbrk(1024), fault page, fork-exit-wait 1,000 times +forks100k sbrk(102400), fault pages, fork-exit-wait 1,000 times +vforks0 vfork-exit-wait 1,000 times +vforks1k sbrk(1024), fault page, vfork-exit-wait 1,000 times +vforks100k sbrk(102400), fault pages, vfork-exit-wait 1,000 times +execs0null fork-exec ``null job''-exit-wait 1,000 times +execs0null (1K env) execs0null above, with 1K environment added +execs1knull sbrk(1024), fault page, fork-exec ``null job''-exit-wait 1,000 times +execs1knull (1K env) execs1knull above, with 1K environment added +execs100knull sbrk(102400), fault pages, fork-exec ``null job''-exit-wait 1,000 times +vexecs0null vfork-exec ``null job''-exit-wait 1,000 times +vexecs1knull sbrk(1024), fault page, vfork-exec ``null job''-exit-wait 1,000 times +vexecs100knull sbrk(102400), fault pages, vfork-exec ``null job''-exit-wait 1,000 times +execs0big fork-exec ``big job''-exit-wait 1,000 times +execs1kbig sbrk(1024), fault page, fork-exec ``big job''-exit-wait 1,000 times +execs100kbig sbrk(102400), fault pages, fork-exec ``big job''-exit-wait 1,000 times +vexecs0big vfork-exec ``big job''-exit-wait 1,000 times +vexecs1kbig sbrk(1024), fault pages, vfork-exec ``big job''-exit-wait 1,000 times +vexecs100kbig sbrk(102400), fault pages, vfork-exec ``big job''-exit-wait 1,000 times +.TE +.ce +Table 1. Kernel Benchmark programs. +.DE +.KE +.PP +The results of these tests are shown in Table 2. If the 4.1BSD results +are scaled to reflect their being run on a VAX-11/750, they +correspond closely to those found in [Joy80].\** +.FS +\** We assume that a VAX-11/750 runs at 60% of the speed of a VAX-11/780 +(not considering floating point operations). +.FE +.KF +.DS L +.TS +center box; +c s s s s s s s s s +c || c s s || c s s || c s s +c || c s s || c s s || c s s +c || c | c | c || c | c | c || c | c | c +l || n | n | n || n | n | n || n | n | n. +Berkeley Software Distribution UNIX Systems +_ +Test Elapsed Time User Time System Time +\^ _ _ _ +\^ 4.1 4.2 4.3 4.1 4.2 4.3 4.1 4.2 4.3 += +syscall 28.0 29.0 23.0 4.5 5.3 3.5 23.9 23.7 20.4 +csw 45.0 60.0 45.0 3.5 4.3 3.3 19.5 25.4 19.0 +signocsw 16.5 23.0 16.0 1.9 3.0 1.1 14.6 20.1 15.2 +pipeself4 21.5 29.0 26.0 1.1 1.1 0.8 20.1 28.0 25.6 +pipeself512 47.5 59.0 55.0 1.2 1.2 1.0 46.1 58.3 54.2 +pipediscard4 32.0 42.0 36.0 3.2 3.7 3.0 15.5 18.8 15.6 +pipediscard512 61.0 76.0 69.0 3.1 2.1 2.0 29.7 36.4 33.2 +pipeback4 57.0 75.0 66.0 2.9 3.2 3.3 25.1 34.2 29.7 +pipeback512 110.0 138.0 125.0 3.1 3.4 2.2 52.2 65.7 57.7 +forks0 37.5 41.0 22.0 0.5 0.3 0.3 34.5 37.6 21.5 +forks1k 40.0 43.0 22.0 0.4 0.3 0.3 36.0 38.8 21.6 +forks100k 217.5 223.0 176.0 0.7 0.6 0.4 214.3 218.4 175.2 +vforks0 34.5 37.0 22.0 0.5 0.6 0.5 27.3 28.5 17.9 +vforks1k 35.0 37.0 22.0 0.6 0.8 0.5 27.2 28.6 17.9 +vforks100k 35.0 37.0 22.0 0.6 0.8 0.6 27.6 28.9 17.9 +execs0null 97.5 92.0 66.0 3.8 2.4 0.6 68.7 82.5 48.6 +execs0null (1K env) 197.0 229.0 75.0 4.1 2.6 0.9 167.8 212.3 62.6 +execs1knull 99.0 100.0 66.0 4.1 1.9 0.6 70.5 86.8 48.7 +execs1knull (1K env) 199.0 230.0 75.0 4.2 2.6 0.7 170.4 214.9 62.7 +execs100knull 283.5 278.0 216.0 4.8 2.8 1.1 251.9 269.3 202.0 +vexecs0null 100.0 92.0 66.0 5.1 2.7 1.1 63.7 76.8 45.1 +vexecs1knull 100.0 91.0 66.0 5.2 2.8 1.1 63.2 77.1 45.1 +vexecs100knull 100.0 92.0 66.0 5.1 3.0 1.1 64.0 77.7 45.6 +execs0big 129.0 201.0 101.0 4.0 3.0 1.0 102.6 153.5 92.7 +execs1kbig 130.0 202.0 101.0 3.7 3.0 1.0 104.7 155.5 93.0 +execs100kbig 318.0 385.0 263.0 4.8 3.1 1.1 286.6 339.1 247.9 +vexecs0big 128.0 200.0 101.0 4.6 3.5 1.6 98.5 149.6 90.4 +vexecs1kbig 125.0 200.0 101.0 4.7 3.5 1.3 98.9 149.3 88.6 +vexecs100kbig 126.0 200.0 101.0 4.2 3.4 1.3 99.5 151.0 89.0 +.TE +.ce +Table 2. Kernel Benchmark results (all times in seconds). +.DE +.KE +.PP +In studying the measurements we found that the basic system call +and context switch overhead did not change significantly +between 4.1BSD and 4.2BSD. The \fIsignocsw\fP results were caused by +the changes to the \fIsignal\fP interface, resulting +in an additional subroutine invocation for each call, not +to mention additional complexity in the system's implementation. +.PP +The times for the use of pipes are significantly higher under +4.2BSD because of their implementation on top of the interprocess +communication facilities. Under 4.1BSD pipes were implemented +without the complexity of the socket data structures and with +simpler code. Further, while not obviously a factor here, +4.2BSD pipes have less system buffer space provided them than +4.1BSD pipes. +.PP +The \fIexec\fP tests shown in Table 2 were performed with 34 bytes of +environment information under 4.1BSD and 40 bytes under 4.2BSD. +To figure the cost of passing data through the environment, +the execs0null and execs1knull tests were rerun with +1065 additional bytes of data. The results are show in Table 3. +.KF +.DS L +.TS +center box; +c || c s || c s || c s +c || c s || c s || c s +c || c | c || c | c || c | c +l || n | n || n | n || n | n. +Test Real User System +\^ _ _ _ +\^ 4.1 4.2 4.1 4.2 4.1 4.2 += +execs0null 197.0 229.0 4.1 2.6 167.8 212.3 +execs1knull 199.0 230.0 4.2 2.6 170.4 214.9 +.TE +.ce +Table 3. Benchmark results with ``large'' environment (all times in seconds). +.DE +.KE +These results show that passing argument data is significantly +slower than under 4.1BSD: 121 ms/byte versus 93 ms/byte. Even using +this factor to adjust the basic overhead of an \fIexec\fP system +call, this facility is more costly under 4.2BSD than under 4.1BSD. +.NH 3 +Path name translation +.PP +The single most expensive function performed by the kernel +is path name translation. +This has been true in almost every UNIX kernel [Mosher80]; +we find that our general time sharing systems do about +500,000 name translations per day. +.PP +Name translations became more expensive in 4.2BSD for several reasons. +The single most expensive addition was the symbolic link. +Symbolic links +have the effect of increasing the average number of components +in path names to be translated. +As an insidious example, +consider the system manager that decides to change /tmp +to be a symbolic link to /usr/tmp. +A name such as /tmp/tmp1234 that previously required two component +translations, +now requires four component translations plus the cost of reading +the contents of the symbolic link. +.PP +The new directory format also changes the characteristics of +name translation. +The more complex format requires more computation to determine +where to place new entries in a directory. +Conversely the additional information allows the system to only +look at active entries when searching, +hence searches of directories that had once grown large +but currently have few active entries are checked quickly. +The new format also stores the length of each name so that +costly string comparisons are only done on names that are the +same length as the name being sought. +.PP +The net effect of the changes is that the average time to +translate a path name in 4.2BSD is 24.2 milliseconds, +representing 40% of the time processing system calls, +that is 19% of the total cycles in the kernel, +or 11% of all cycles executed on the machine. +The times are shown in Table 4. We have no comparable times +for \fInamei\fP under 4.1 though they are certain to +be significantly less. +.KF +.DS L +.TS +center box; +l r r. +part time % of kernel +_ +self 14.3 ms/call 11.3% +child 9.9 ms/call 7.9% +_ +total 24.2 ms/call 19.2% +.TE +.ce +Table 4. Call times for \fInamei\fP in 4.2BSD. +.DE +.KE +.NH 3 +Clock processing +.PP +Nearly 25% of the time spent in the kernel is spent in the clock +processing routines. +(This is a clear indication that to avoid sampling bias when profiling the +kernel with our tools +we need to drive them from an independent clock.) +These routines are responsible for implementing timeouts, +scheduling the processor, +maintaining kernel statistics, +and tending various hardware operations such as +draining the terminal input silos. +Only minimal work is done in the hardware clock interrupt +routine (at high priority), the rest is performed (at a lower priority) +in a software interrupt handler scheduled by the hardware interrupt +handler. +In the worst case, with a clock rate of 100 Hz +and with every hardware interrupt scheduling a software +interrupt, the processor must field 200 interrupts per second. +The overhead of simply trapping and returning +is 3% of the machine cycles, +figuring out that there is nothing to do +requires an additional 2%. +.NH 3 +Terminal multiplexors +.PP +The terminal multiplexors supported by 4.2BSD have programmable receiver +silos that may be used in two ways. +With the silo disabled, each character received causes an interrupt +to the processor. +Enabling the receiver silo allows the silo to fill before +generating an interrupt, allowing multiple characters to be read +for each interrupt. +At low rates of input, received characters will not be processed +for some time unless the silo is emptied periodically. +The 4.2BSD kernel uses the input silos of each terminal multiplexor, +and empties each silo on each clock interrupt. +This allows high input rates without the cost of per-character interrupts +while assuring low latency. +However, as character input rates on most machines are usually +low (about 25 characters per second), +this can result in excessive overhead. +At the current clock rate of 100 Hz, a machine with 5 terminal multiplexors +configured makes 500 calls to the receiver interrupt routines per second. +In addition, to achieve acceptable input latency +for flow control, each clock interrupt must schedule +a software interrupt to run the silo draining routines.\** +.FS +\** It is not possible to check the input silos at +the time of the actual clock interrupt without modifying the terminal +line disciplines, as the input queues may not be in a consistent state \**. +.FE +\** This implies that the worst case estimate for clock processing +is the basic overhead for clock processing. +.NH 3 +Process table management +.PP +In 4.2BSD there are numerous places in the kernel where a linear search +of the process table is performed: +.IP \(bu 3 +in \fIexit\fP to locate and wakeup a process's parent; +.IP \(bu 3 +in \fIwait\fP when searching for \fB\s-2ZOMBIE\s+2\fP and +\fB\s-2STOPPED\s+2\fP processes; +.IP \(bu 3 +in \fIfork\fP when allocating a new process table slot and +counting the number of processes already created by a user; +.IP \(bu 3 +in \fInewproc\fP, to verify +that a process id assigned to a new process is not currently +in use; +.IP \(bu 3 +in \fIkill\fP and \fIgsignal\fP to locate all processes to +which a signal should be delivered; +.IP \(bu 3 +in \fIschedcpu\fP when adjusting the process priorities every +second; and +.IP \(bu 3 +in \fIsched\fP when locating a process to swap out and/or swap +in. +.LP +These linear searches can incur significant overhead. The rule +for calculating the size of the process table is: +.ce +nproc = 20 + 8 * maxusers +.sp +that means a 48 user system will have a 404 slot process table. +With the addition of network services in 4.2BSD, as many as a dozen +server processes may be maintained simply to await incoming requests. +These servers are normally created at boot time which causes them +to be allocated slots near the beginning of the process table. This +means that process table searches under 4.2BSD are likely to take +significantly longer than under 4.1BSD. System profiling shows +that as much as 20% of the time spent in the kernel on a loaded +system (a VAX-11/780) can be spent in \fIschedcpu\fP and, on average, +5-10% of the kernel time is spent in \fIschedcpu\fP. +The other searches of the proc table are similarly affected. +This shows the system can no longer tolerate using linear searches of +the process table. +.NH 3 +File system buffer cache +.PP +The trace facilities described in section 2.3 were used +to gather statistics on the performance of the buffer cache. +We were interested in measuring the effectiveness of the +cache and the read-ahead policies. +With the file system block size in 4.2BSD four to +eight times that of a 4.1BSD file system, we were concerned +that large amounts of read-ahead might be performed without +being used. Also, we were interested in seeing if the +rules used to size the buffer cache at boot time were severely +affecting the overall cache operation. +.PP +The tracing package was run over a three hour period during +a peak mid-afternoon period on a VAX 11/780 with four megabytes +of physical memory. +This resulted in a buffer cache containing 400 kilobytes of memory +spread among 50 to 200 buffers +(the actual number of buffers depends on the size mix of +disk blocks being read at any given time). +The pertinent configuration information is shown in Table 5. +.KF +.DS L +.TS +center box; +l l l l. +Controller Drive Device File System +_ +DEC MASSBUS DEC RP06 hp0d /usr + hp0b swap +Emulex SC780 Fujitsu Eagle hp1a /usr/spool/news + hp1b swap + hp1e /usr/src + hp1d /u0 (users) + Fujitsu Eagle hp2a /tmp + hp2b swap + hp2d /u1 (users) + Fujitsu Eagle hp3a / +.TE +.ce +Table 5. Active file systems during buffer cache tests. +.DE +.KE +.PP +During the test period the load average ranged from 2 to 13 +with an average of 5. +The system had no idle time, 43% user time, and 57% system time. +The system averaged 90 interrupts per second +(excluding the system clock interrupts), +220 system calls per second, +and 50 context switches per second (40 voluntary, 10 involuntary). +.PP +The active virtual memory (the sum of the address space sizes of +all jobs that have run in the previous twenty seconds) +over the period ranged from 2 to 6 megabytes with an average +of 3.5 megabytes. +There was no swapping, though the page daemon was inspecting +about 25 pages per second. +.PP +On average 250 requests to read disk blocks were initiated +per second. +These include read requests for file blocks made by user +programs as well as requests initiated by the system. +System reads include requests for indexing information to determine +where a file's next data block resides, +file system layout maps to allocate new data blocks, +and requests for directory contents needed to do path name translations. +.PP +On average, an 85% cache hit rate was observed for read requests. +Thus only 37 disk reads were initiated per second. +In addition, 5 read-ahead requests were made each second +filling about 20% of the buffer pool. +Despite the policies to rapidly reuse read-ahead buffers +that remain unclaimed, more than 90% of the read-ahead +buffers were used. +.PP +These measurements showed that the buffer cache was working +effectively. Independent tests have also showed that the size +of the buffer cache may be reduced significantly on memory-poor +system without severe effects; +we have not yet tested this hypothesis [Shannon83]. +.NH 3 +Network subsystem +.PP +The overhead associated with the +network facilities found in 4.2BSD is often +difficult to gauge without profiling the system. +This is because most input processing is performed +in modules scheduled with software interrupts. +As a result, the system time spent performing protocol +processing is rarely attributed to the processes that +really receive the data. Since the protocols supported +by 4.2BSD can involve significant overhead this was a serious +concern. Results from a profiled kernel show an average +of 5% of the system time is spent +performing network input and timer processing in our environment +(a 3Mb/s Ethernet with most traffic using TCP). +This figure can vary significantly depending on +the network hardware used, the average message +size, and whether packet reassembly is required at the network +layer. On one machine we profiled over a 17 hour +period (our gateway to the ARPANET) +206,000 input messages accounted for 2.4% of the system time, +while another 0.6% of the system time was spent performing +protocol timer processing. +This machine was configured with an ACC LH/DH IMP interface +and a DMA 3Mb/s Ethernet controller. +.PP +The performance of TCP over slower long-haul networks +was degraded substantially by two problems. +The first problem was a bug that prevented round-trip timing measurements +from being made, thus increasing retransmissions unnecessarily. +The second was a problem with the maximum segment size chosen by TCP, +that was well-tuned for Ethernet, but was poorly chosen for +the ARPANET, where it causes packet fragmentation. (The maximum +segment size was actually negotiated upwards to a value that +resulted in excessive fragmentation.) +.PP +When benchmarked in Ethernet environments the main memory buffer management +of the network subsystem presented some performance anomalies. +The overhead of processing small ``mbufs'' severely affected throughput for a +substantial range of message sizes. +In spite of the fact that most system ustilities made use of the throughput +optimal 1024 byte size, user processes faced large degradations for some +arbitrary sizes. This was specially true for TCP/IP transmissions [Cabrera84, +Cabrera85]. +.NH 3 +Virtual memory subsystem +.PP +We ran a set of tests intended to exercise the virtual +memory system under both 4.1BSD and 4.2BSD. +The tests are described in Table 6. +The test programs dynamically allocated +a 7.3 Megabyte array (using \fIsbrk\fP\|(2)) then referenced +pages in the array either: sequentially, in a purely random +fashion, or such that the distance between +successive pages accessed was randomly selected from a Gaussian +distribution. In the last case, successive runs were made with +increasing standard deviations. +.KF +.DS L +.TS +center box; +l | l. +Test Description +_ +seqpage sequentially touch pages, 10 iterations +seqpage-v as above, but first make \fIvadvise\fP\|(2) call +randpage touch random page 30,000 times +randpage-v as above, but first make \fIvadvise\fP call +gausspage.1 30,000 Gaussian accesses, standard deviation of 1 +gausspage.10 as above, standard deviation of 10 +gausspage.30 as above, standard deviation of 30 +gausspage.40 as above, standard deviation of 40 +gausspage.50 as above, standard deviation of 50 +gausspage.60 as above, standard deviation of 60 +gausspage.80 as above, standard deviation of 80 +gausspage.inf as above, standard deviation of 10,000 +.TE +.ce +Table 6. Paging benchmark programs. +.DE +.KE +.PP +The results in Table 7 show how the additional +memory requirements +of 4.2BSD can generate more work for the paging system. +Under 4.1BSD, +the system used 0.5 of the 4.5 megabytes of physical memory +on the test machine; +under 4.2BSD it used nearly 1 megabyte of physical memory.\** +.FS +\** The 4.1BSD system used for testing was really a 4.1a +system configured +with networking facilities and code to support +remote file access. The +4.2BSD system also included the remote file access code. +Since both +systems would be larger than similarly configured ``vanilla'' +4.1BSD or 4.2BSD system, we consider out conclusions to still be valid. +.FE +This resulted in more page faults and, hence, more system time. +To establish a common ground on which to compare the paging +routines of each system, we check instead the average page fault +service times for those test runs that had a statistically significant +number of random page faults. These figures, shown in Table 8, show +no significant difference between the two systems in +the area of page fault servicing. We currently have +no explanation for the results of the sequential +paging tests. +.KF +.DS L +.TS +center box; +l || c s || c s || c s || c s +l || c s || c s || c s || c s +l || c | c || c | c || c | c || c | c +l || n | n || n | n || n | n || n | n. +Test Real User System Page Faults +\^ _ _ _ _ +\^ 4.1 4.2 4.1 4.2 4.1 4.2 4.1 4.2 += +seqpage 959 1126 16.7 12.8 197.0 213.0 17132 17113 +seqpage-v 579 812 3.8 5.3 216.0 237.7 8394 8351 +randpage 571 569 6.7 7.6 64.0 77.2 8085 9776 +randpage-v 572 562 6.1 7.3 62.2 77.5 8126 9852 +gausspage.1 25 24 23.6 23.8 0.8 0.8 8 8 +gausspage.10 26 26 22.7 23.0 3.2 3.6 2 2 +gausspage.30 34 33 25.0 24.8 8.6 8.9 2 2 +gausspage.40 42 81 23.9 25.0 11.5 13.6 3 260 +gausspage.50 113 175 24.2 26.2 19.6 26.3 784 1851 +gausspage.60 191 234 27.6 26.7 27.4 36.0 2067 3177 +gausspage.80 312 329 28.0 27.9 41.5 52.0 3933 5105 +gausspage.inf 619 621 82.9 85.6 68.3 81.5 8046 9650 +.TE +.ce +Table 7. Paging benchmark results (all times in seconds). +.DE +.KE +.KF +.DS L +.TS +center box; +c || c s || c s +c || c s || c s +c || c | c || c | c +l || n | n || n | n. +Test Page Faults PFST +\^ _ _ +\^ 4.1 4.2 4.1 4.2 += +randpage 8085 9776 791 789 +randpage-v 8126 9852 765 786 +gausspage.inf 8046 9650 848 844 +.TE +.ce +Table 8. Page fault service times (all times in microseconds). +.DE +.KE |