summaryrefslogtreecommitdiff
path: root/share/doc/smm/05.fastfs/3.t
diff options
context:
space:
mode:
Diffstat (limited to 'share/doc/smm/05.fastfs/3.t')
-rw-r--r--share/doc/smm/05.fastfs/3.t592
1 files changed, 0 insertions, 592 deletions
diff --git a/share/doc/smm/05.fastfs/3.t b/share/doc/smm/05.fastfs/3.t
deleted file mode 100644
index 813b0e6f6ce..00000000000
--- a/share/doc/smm/05.fastfs/3.t
+++ /dev/null
@@ -1,592 +0,0 @@
-.\" $OpenBSD: 3.t,v 1.3 2003/06/02 23:30:11 millert Exp $
-.\"
-.\" Copyright (c) 1986, 1993
-.\" The Regents of the University of California. All rights reserved.
-.\"
-.\" Redistribution and use in source and binary forms, with or without
-.\" modification, are permitted provided that the following conditions
-.\" are met:
-.\" 1. Redistributions of source code must retain the above copyright
-.\" notice, this list of conditions and the following disclaimer.
-.\" 2. Redistributions in binary form must reproduce the above copyright
-.\" notice, this list of conditions and the following disclaimer in the
-.\" documentation and/or other materials provided with the distribution.
-.\" 3. Neither the name of the University nor the names of its contributors
-.\" may be used to endorse or promote products derived from this software
-.\" without specific prior written permission.
-.\"
-.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
-.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
-.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
-.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
-.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
-.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
-.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
-.\" SUCH DAMAGE.
-.\"
-.\" @(#)3.t 8.1 (Berkeley) 6/8/93
-.\"
-.ds RH New file system
-.NH
-New file system organization
-.PP
-In the new file system organization (as in the
-old file system organization),
-each disk drive contains one or more file systems.
-A file system is described by its super-block,
-located at the beginning of the file system's disk partition.
-Because the super-block contains critical data,
-it is replicated to protect against catastrophic loss.
-This is done when the file system is created;
-since the super-block data does not change,
-the copies need not be referenced unless a head crash
-or other hard disk error causes the default super-block
-to be unusable.
-.PP
-To insure that it is possible to create files as large as
-$2 sup 32$ bytes with only two levels of indirection,
-the minimum size of a file system block is 4096 bytes.
-The size of file system blocks can be any power of two
-greater than or equal to 4096.
-The block size of a file system is recorded in the
-file system's super-block
-so it is possible for file systems with different block sizes
-to be simultaneously accessible on the same system.
-The block size must be decided at the time that
-the file system is created;
-it cannot be subsequently changed without rebuilding the file system.
-.PP
-The new file system organization divides a disk partition
-into one or more areas called
-.I "cylinder groups".
-A cylinder group is comprised of one or more consecutive
-cylinders on a disk.
-Associated with each cylinder group is some bookkeeping information
-that includes a redundant copy of the super-block,
-space for inodes,
-a bit map describing available blocks in the cylinder group,
-and summary information describing the usage of data blocks
-within the cylinder group.
-The bit map of available blocks in the cylinder group replaces
-the traditional file system's free list.
-For each cylinder group a static number of inodes
-is allocated at file system creation time.
-The default policy is to allocate one inode for each 2048
-bytes of space in the cylinder group, expecting this
-to be far more than will ever be needed.
-.PP
-All the cylinder group bookkeeping information could be
-placed at the beginning of each cylinder group.
-However if this approach were used,
-all the redundant information would be on the top platter.
-A single hardware failure that destroyed the top platter
-could cause the loss of all redundant copies of the super-block.
-Thus the cylinder group bookkeeping information
-begins at a varying offset from the beginning of the cylinder group.
-The offset for each successive cylinder group is calculated to be
-about one track further from the beginning of the cylinder group
-than the preceding cylinder group.
-In this way the redundant
-information spirals down into the pack so that any single track, cylinder,
-or platter can be lost without losing all copies of the super-block.
-Except for the first cylinder group,
-the space between the beginning of the cylinder group
-and the beginning of the cylinder group information
-is used for data blocks.\(dg
-.FS
-\(dg While it appears that the first cylinder group could be laid
-out with its super-block at the ``known'' location,
-this would not work for file systems
-with blocks sizes of 16 kilobytes or greater.
-This is because of a requirement that the first 8 kilobytes of the disk
-be reserved for a bootstrap program and a separate requirement that
-the cylinder group information begin on a file system block boundary.
-To start the cylinder group on a file system block boundary,
-file systems with block sizes larger than 8 kilobytes
-would have to leave an empty space between the end of
-the boot block and the beginning of the cylinder group.
-Without knowing the size of the file system blocks,
-the system would not know what roundup function to use
-to find the beginning of the first cylinder group.
-.FE
-.NH 2
-Optimizing storage utilization
-.PP
-Data is laid out so that larger blocks can be transferred
-in a single disk transaction, greatly increasing file system throughput.
-As an example, consider a file in the new file system
-composed of 4096 byte data blocks.
-In the old file system this file would be composed of 1024 byte blocks.
-By increasing the block size, disk accesses in the new file
-system may transfer up to four times as much information per
-disk transaction.
-In large files, several
-4096 byte blocks may be allocated from the same cylinder so that
-even larger data transfers are possible before requiring a seek.
-.PP
-The main problem with
-larger blocks is that most UNIX
-file systems are composed of many small files.
-A uniformly large block size wastes space.
-Table 1 shows the effect of file system
-block size on the amount of wasted space in the file system.
-The files measured to obtain these figures reside on
-one of our time sharing
-systems that has roughly 1.2 gigabytes of on-line storage.
-The measurements are based on the active user file systems containing
-about 920 megabytes of formatted space.
-.KF
-.DS B
-.TS
-box;
-l|l|l
-a|n|l.
-Space used % waste Organization
-_
-775.2 Mb 0.0 Data only, no separation between files
-807.8 Mb 4.2 Data only, each file starts on 512 byte boundary
-828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system
-866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system
-948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system
-1128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system
-.TE
-Table 1 \- Amount of wasted space as a function of block size.
-.DE
-.KE
-The space wasted is calculated to be the percentage of space
-on the disk not containing user data.
-As the block size on the disk
-increases, the waste rises quickly, to an intolerable
-45.6% waste with 4096 byte file system blocks.
-.PP
-To be able to use large blocks without undue waste,
-small files must be stored in a more efficient way.
-The new file system accomplishes this goal by allowing the division
-of a single file system block into one or more
-.I "fragments".
-The file system fragment size is specified
-at the time that the file system is created;
-each file system block can optionally be broken into
-2, 4, or 8 fragments, each of which is addressable.
-The lower bound on the size of these fragments is constrained
-by the disk sector size,
-typically 512 bytes.
-The block map associated with each cylinder group
-records the space available in a cylinder group
-at the fragment level;
-to determine if a block is available, aligned fragments are examined.
-Figure 1 shows a piece of a map from a 4096/1024 file system.
-.KF
-.DS B
-.TS
-box;
-l|c c c c.
-Bits in map XXXX XXOO OOXX OOOO
-Fragment numbers 0-3 4-7 8-11 12-15
-Block numbers 0 1 2 3
-.TE
-Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system.
-.DE
-.KE
-Each bit in the map records the status of a fragment;
-an ``X'' shows that the fragment is in use,
-while a ``O'' shows that the fragment is available for allocation.
-In this example,
-fragments 0\-5, 10, and 11 are in use,
-while fragments 6\-9, and 12\-15 are free.
-Fragments of adjoining blocks cannot be used as a full block,
-even if they are large enough.
-In this example,
-fragments 6\-9 cannot be allocated as a full block;
-only fragments 12\-15 can be coalesced into a full block.
-.PP
-On a file system with a block size of 4096 bytes
-and a fragment size of 1024 bytes,
-a file is represented by zero or more 4096 byte blocks of data,
-and possibly a single fragmented block.
-If a file system block must be fragmented to obtain
-space for a small amount of data,
-the remaining fragments of the block are made
-available for allocation to other files.
-As an example consider an 11000 byte file stored on
-a 4096/1024 byte file system.
-This file would uses two full size blocks and one
-three fragment portion of another block.
-If no block with three aligned fragments is
-available at the time the file is created,
-a full size block is split yielding the necessary
-fragments and a single unused fragment.
-This remaining fragment can be allocated to another file as needed.
-.PP
-Space is allocated to a file when a program does a \fIwrite\fP
-system call.
-Each time data is written to a file, the system checks to see if
-the size of the file has increased*.
-.FS
-* A program may be overwriting data in the middle of an existing file
-in which case space would already have been allocated.
-.FE
-If the file needs to be expanded to hold the new data,
-one of three conditions exists:
-.IP 1)
-There is enough space left in an already allocated
-block or fragment to hold the new data.
-The new data is written into the available space.
-.IP 2)
-The file contains no fragmented blocks (and the last
-block in the file
-contains insufficient space to hold the new data).
-If space exists in a block already allocated,
-the space is filled with new data.
-If the remainder of the new data contains more than
-a full block of data, a full block is allocated and
-the first full block of new data is written there.
-This process is repeated until less than a full block
-of new data remains.
-If the remaining new data to be written will
-fit in less than a full block,
-a block with the necessary fragments is located,
-otherwise a full block is located.
-The remaining new data is written into the located space.
-.IP 3)
-The file contains one or more fragments (and the
-fragments contain insufficient space to hold the new data).
-If the size of the new data plus the size of the data
-already in the fragments exceeds the size of a full block,
-a new block is allocated.
-The contents of the fragments are copied
-to the beginning of the block
-and the remainder of the block is filled with new data.
-The process then continues as in (2) above.
-Otherwise, if the new data to be written will
-fit in less than a full block,
-a block with the necessary fragments is located,
-otherwise a full block is located.
-The contents of the existing fragments
-appended with the new data
-are written into the allocated space.
-.PP
-The problem with expanding a file one fragment at a
-a time is that data may be copied many times as a
-fragmented block expands to a full block.
-Fragment reallocation can be minimized
-if the user program writes a full block at a time,
-except for a partial block at the end of the file.
-Since file systems with different block sizes may reside on
-the same system,
-the file system interface has been extended to provide
-application programs the optimal size for a read or write.
-For files the optimal size is the block size of the file system
-on which the file is being accessed.
-For other objects, such as pipes and sockets,
-the optimal size is the underlying buffer size.
-This feature is used by the Standard
-Input/Output Library,
-a package used by most user programs.
-This feature is also used by
-certain system utilities such as archivers and loaders
-that do their own input and output management
-and need the highest possible file system bandwidth.
-.PP
-The amount of wasted space in the 4096/1024 byte new file system
-organization is empirically observed to be about the same as in the
-1024 byte old file system organization.
-A file system with 4096 byte blocks and 512 byte fragments
-has about the same amount of wasted space as the 512 byte
-block UNIX file system.
-The new file system uses less space
-than the 512 byte or 1024 byte
-file systems for indexing information for
-large files and the same amount of space
-for small files.
-These savings are offset by the need to use
-more space for keeping track of available free blocks.
-The net result is about the same disk utilization
-when a new file system's fragment size
-equals an old file system's block size.
-.PP
-In order for the layout policies to be effective,
-a file system cannot be kept completely full.
-For each file system there is a parameter, termed
-the free space reserve, that
-gives the minimum acceptable percentage of file system
-blocks that should be free.
-If the number of free blocks drops below this level
-only the system administrator can continue to allocate blocks.
-The value of this parameter may be changed at any time,
-even when the file system is mounted and active.
-The transfer rates that appear in section 4 were measured on file
-systems kept less than 90% full (a reserve of 10%).
-If the number of free blocks falls to zero,
-the file system throughput tends to be cut in half,
-because of the inability of the file system to localize
-blocks in a file.
-If a file system's performance degrades because
-of overfilling, it may be restored by removing
-files until the amount of free space once again
-reaches the minimum acceptable level.
-Access rates for files created during periods of little
-free space may be restored by moving their data once enough
-space is available.
-The free space reserve must be added to the
-percentage of waste when comparing the organizations given
-in Table 1.
-Thus, the percentage of waste in
-an old 1024 byte UNIX file system is roughly
-comparable to a new 4096/512 byte file system
-with the free space reserve set at 5%.
-(Compare 11.8% wasted with the old file system
-to 6.9% waste + 5% reserved space in the
-new file system.)
-.NH 2
-File system parameterization
-.PP
-Except for the initial creation of the free list,
-the old file system ignores the parameters of the underlying hardware.
-It has no information about either the physical characteristics
-of the mass storage device,
-or the hardware that interacts with it.
-A goal of the new file system is to parameterize the
-processor capabilities and
-mass storage characteristics
-so that blocks can be allocated in an
-optimum configuration-dependent way.
-Parameters used include the speed of the processor,
-the hardware support for mass storage transfers,
-and the characteristics of the mass storage devices.
-Disk technology is constantly improving and
-a given installation can have several different disk technologies
-running on a single processor.
-Each file system is parameterized so that it can be
-adapted to the characteristics of the disk on which
-it is placed.
-.PP
-For mass storage devices such as disks,
-the new file system tries to allocate new blocks
-on the same cylinder as the previous block in the same file.
-Optimally, these new blocks will also be
-rotationally well positioned.
-The distance between ``rotationally optimal'' blocks varies greatly;
-it can be a consecutive block
-or a rotationally delayed block
-depending on system characteristics.
-On a processor with an input/output channel that does not require
-any processor intervention between mass storage transfer requests,
-two consecutive disk blocks can often be accessed
-without suffering lost time because of an intervening disk revolution.
-For processors without input/output channels,
-the main processor must field an interrupt and
-prepare for a new disk transfer.
-The expected time to service this interrupt and
-schedule a new disk transfer depends on the
-speed of the main processor.
-.PP
-The physical characteristics of each disk include
-the number of blocks per track and the rate at which
-the disk spins.
-The allocation routines use this information to calculate
-the number of milliseconds required to skip over a block.
-The characteristics of the processor include
-the expected time to service an interrupt and schedule a
-new disk transfer.
-Given a block allocated to a file,
-the allocation routines calculate the number of blocks to
-skip over so that the next block in the file will
-come into position under the disk head in the expected
-amount of time that it takes to start a new
-disk transfer operation.
-For programs that sequentially access large amounts of data,
-this strategy minimizes the amount of time spent waiting for
-the disk to position itself.
-.PP
-To ease the calculation of finding rotationally optimal blocks,
-the cylinder group summary information includes
-a count of the available blocks in a cylinder
-group at different rotational positions.
-Eight rotational positions are distinguished,
-so the resolution of the
-summary information is 2 milliseconds for a typical 3600
-revolution per minute drive.
-The super-block contains a vector of lists called
-.I "rotational layout tables".
-The vector is indexed by rotational position.
-Each component of the vector
-lists the index into the block map for every data block contained
-in its rotational position.
-When looking for an allocatable block,
-the system first looks through the summary counts for a rotational
-position with a non-zero block count.
-It then uses the index of the rotational position to find the appropriate
-list to use to index through
-only the relevant parts of the block map to find a free block.
-.PP
-The parameter that defines the
-minimum number of milliseconds between the completion of a data
-transfer and the initiation of
-another data transfer on the same cylinder
-can be changed at any time,
-even when the file system is mounted and active.
-If a file system is parameterized to lay out blocks with
-a rotational separation of 2 milliseconds,
-and the disk pack is then moved to a system that has a
-processor requiring 4 milliseconds to schedule a disk operation,
-the throughput will drop precipitously because of lost disk revolutions
-on nearly every block.
-If the eventual target machine is known,
-the file system can be parameterized for it
-even though it is initially created on a different processor.
-Even if the move is not known in advance,
-the rotational layout delay can be reconfigured after the disk is moved
-so that all further allocation is done based on the
-characteristics of the new host.
-.NH 2
-Layout policies
-.PP
-The file system layout policies are divided into two distinct parts.
-At the top level are global policies that use file system
-wide summary information to make decisions regarding
-the placement of new inodes and data blocks.
-These routines are responsible for deciding the
-placement of new directories and files.
-They also calculate rotationally optimal block layouts,
-and decide when to force a long seek to a new cylinder group
-because there are insufficient blocks left
-in the current cylinder group to do reasonable layouts.
-Below the global policy routines are
-the local allocation routines that use a locally optimal scheme to
-lay out data blocks.
-.PP
-Two methods for improving file system performance are to increase
-the locality of reference to minimize seek latency
-as described by [Trivedi80], and
-to improve the layout of data to make larger transfers possible
-as described by [Nevalainen77].
-The global layout policies try to improve performance
-by clustering related information.
-They cannot attempt to localize all data references,
-but must also try to spread unrelated data
-among different cylinder groups.
-If too much localization is attempted,
-the local cylinder group may run out of space
-forcing the data to be scattered to non-local cylinder groups.
-Taken to an extreme,
-total localization can result in a single huge cluster of data
-resembling the old file system.
-The global policies try to balance the two conflicting
-goals of localizing data that is concurrently accessed
-while spreading out unrelated data.
-.PP
-One allocatable resource is inodes.
-Inodes are used to describe both files and directories.
-Inodes of files in the same directory are frequently accessed together.
-For example, the ``list directory'' command often accesses
-the inode for each file in a directory.
-The layout policy tries to place all the inodes of
-files in a directory in the same cylinder group.
-To ensure that files are distributed throughout the disk,
-a different policy is used for directory allocation.
-A new directory is placed in a cylinder group that has a greater
-than average number of free inodes,
-and the smallest number of directories already in it.
-The intent of this policy is to allow the inode clustering policy
-to succeed most of the time.
-The allocation of inodes within a cylinder group is done using a
-next free strategy.
-Although this allocates the inodes randomly within a cylinder group,
-all the inodes for a particular cylinder group can be read with
-8 to 16 disk transfers.
-(At most 16 disk transfers are required because a cylinder
-group may have no more than 2048 inodes.)
-This puts a small and constant upper bound on the number of
-disk transfers required to access the inodes
-for all the files in a directory.
-In contrast, the old file system typically requires
-one disk transfer to fetch the inode for each file in a directory.
-.PP
-The other major resource is data blocks.
-Since data blocks for a file are typically accessed together,
-the policy routines try to place all data
-blocks for a file in the same cylinder group,
-preferably at rotationally optimal positions in the same cylinder.
-The problem with allocating all the data blocks
-in the same cylinder group is that large files will
-quickly use up available space in the cylinder group,
-forcing a spill over to other areas.
-Further, using all the space in a cylinder group
-causes future allocations for any file in the cylinder group
-to also spill to other areas.
-Ideally none of the cylinder groups should ever become completely full.
-The heuristic solution chosen is to
-redirect block allocation
-to a different cylinder group
-when a file exceeds 48 kilobytes,
-and at every megabyte thereafter.*
-.FS
-* The first spill over point at 48 kilobytes is the point
-at which a file on a 4096 byte block file system first
-requires a single indirect block. This appears to be
-a natural first point at which to redirect block allocation.
-The other spillover points are chosen with the intent of
-forcing block allocation to be redirected when a
-file has used about 25% of the data blocks in a cylinder group.
-In observing the new file system in day to day use, the heuristics appear
-to work well in minimizing the number of completely filled
-cylinder groups.
-.FE
-The newly chosen cylinder group is selected from those cylinder
-groups that have a greater than average number of free blocks left.
-Although big files tend to be spread out over the disk,
-a megabyte of data is typically accessible before
-a long seek must be performed,
-and the cost of one long seek per megabyte is small.
-.PP
-The global policy routines call local allocation routines with
-requests for specific blocks.
-The local allocation routines will
-always allocate the requested block
-if it is free, otherwise it
-allocates a free block of the requested size that is
-rotationally closest to the requested block.
-If the global layout policies had complete information,
-they could always request unused blocks and
-the allocation routines would be reduced to simple bookkeeping.
-However, maintaining complete information is costly;
-thus the implementation of the global layout policy
-uses heuristics that employ only partial information.
-.PP
-If a requested block is not available, the local allocator uses
-a four level allocation strategy:
-.IP 1)
-Use the next available block rotationally closest
-to the requested block on the same cylinder. It is assumed
-here that head switching time is zero. On disk
-controllers where this is not the case, it may be possible
-to incorporate the time required to switch between disk platters
-when constructing the rotational layout tables. This, however,
-has not yet been tried.
-.IP 2)
-If there are no blocks available on the same cylinder,
-use a block within the same cylinder group.
-.IP 3)
-If that cylinder group is entirely full,
-quadratically hash the cylinder group number to choose
-another cylinder group to look for a free block.
-.IP 4)
-Finally if the hash fails, apply an exhaustive search
-to all cylinder groups.
-.PP
-Quadratic hash is used because of its speed in finding
-unused slots in nearly full hash tables [Knuth75].
-File systems that are parameterized to maintain at least
-10% free space rarely use this strategy.
-File systems that are run without maintaining any free
-space typically have so few free blocks that almost any
-allocation is random;
-the most important characteristic of
-the strategy used under such conditions is that the strategy be fast.
-.ds RH Performance
-.sp 2
-.ne 1i