X3T9.2/88-139 NOTE: This document was generated by William E. Burr of NIST (was NBS). The August working group requested that it be distributed to X3T9.2 as a possible basis for the Direct-access Device model. Bill gave me an ASCII version of his file which I have reformatted for WordStar. The figures are not included in this version -- John Lohmeyer. A Rotating Direct Access Storage Functional Reference Model Introduction This model provides a conceptual framework for describing the functions of direct access rotating storage, such as magnetic and optical disk drives, used in computer systems, for the purpose of describing where I/O interfaces lie in the hierarchy of functions. It is intended to be descriptive of the way in which such storage systems are conventionally attached to computer systems. At the high end of this model is always an application process running on a computer, and requesting access to data, and at the very bottom is always some rotating recording surface, containing bit serial tracks of information, and read by some physical process (such as magnetization of ferromagnetic particles) involving positioning a transducer at the desired location along a planar path parallel to the rotating surface (normally by motion using a device generically called an actuator, or, less commonly, by selecting one of many fixed transducers) and then rotating the surface until some desired location passes under the transducer. Most commonly the rotating surface is a disk, in some cases, however, it is a cylinder. The remaining functions may be split in many ways between the software of the host computer, the logic and programs of various intermediate controllers, and the logic or processors actually imbedded in the drive itself. Various interface standards facilitate this splitting of functions at different levels. The model, shown in Figure 1, is split in two parallel sides, one, the Positioning Side, involved with positioning of the transducer and locating specific blocks of data, and the other, the Data Side, concerned with reading and writing that data. The Data Side may feed positioning information back to the Positioning Side. Physical Recording Level Actuator The tracks of data may be either concentric circles or a spirals. In either case some actuator, typically a stepper motor in open loop systems or a servo motor in closed loop systems, is used to position the transducer(s) on or over some data track. In magnetic disks a number of parallel surfaces are frequently attached to a single shaft and the actuator moves a number of transducers, one or more for each surface, all at the same time. There may be more than one actuator per rotating shaft or spindle, in which case each actuator usually accesses nonoverlapping, concentric bands of data, and is considered to be a separate logical storage device. Magnetic disks nearly always use a concentric circular track arrangement, while optical disks frequently use spiral tracks. With multiple surfaces, the tracks are conventionally considered to be arranged as cylinders of tracks, one on each surface at the same distance from the axis of rotation, and positioning involves the selection of the particular transducer associated with the desired data location. Drives have been built which activate several transducers in parallel; they are not common. Some synchronizing signal, detected either by the data transducer or by some separate fixed transducer, is used to find the "index" point on the track, and then data is often then located by rotational position from that point. Recording Two types of recording are in common use: magnetic and optical. In magnetic recording, the transducer must be in close proximity to the recording surface, either in contact, as in flexible disk devices, or "flying" on a thin air bearing above the recording surface, as in most hard disks. Optical drives can locate the transducer at some distance from the surface. In magnetic recording an electric current in the transducer, called a "head," is used to magnetize material in the recording surface. In writable optical drives a laser beam is used to produce some change in the state of the surface, often a bubble or pit. In CD ROM drives a stamping process is used to produce pits in the surface. When reading magnetic drives, a change in the direction of magnetization, that is a flux transition, induces a signal in the head. An optical surface is read by shining a less intense (than required to write) laser beam on the surface. In general, what is detected in either case is a state transition, whether of the direction of magnetization or the optical characteristics of the surface. The surface is considered to be divided into a number of equally sized "bit cells" and a transition detected in the bit cell is usually considered to be a code one, while no transition is a zero. Coding Level While the Physical Recording Level is concerned primarily with positioning the transducer and using it to produce or sense recording medium state transitions, the coding level converts data to the required states and state transitions back to data, as well as, in some cases, using signals in the recording surface to adjust the position of the actuator. Position Control Function Open loop actuators require no feedback from the recording surface itself for positioning, however the track to track densities and positioning times which can be achieved with open loop systems are limited. Every shaft wobbles a little as it spins. Materials in the recording surface and the actuator expand and contract with changes in temperature. There are limits to how precisely a stepper motor can position itself. Closed loop systems use a signal on the recording surface itself, or on a separate reference surface, or both, to generate a feedback signal for a servo motor actuator and actively follow the track as the surface rotates. Much higher track to track densities are possible with closed loop positioners. Position control signals between the Position Control and Actuator functions in open loop systems simply consist of a series of plus or minus step pulses, one for each track to be moved in or out. In closed loop systems, a continuous plus or minus current is supplied to the servo motor until the transducer is properly positioned, and whenever any deviation from the track is sensed. The input to the Position Control function from the higher mapping layers is typically the binary number of the desired cylinder position. Data Separator Function The code bits formed by the transitions on the recording surface are not themselves direct data bits, they are code bits. At high recording densities, it is impossible to maintain a sufficiently precise time base to tell if there are precisely 1000 zeroes in a row, or 1001. Therefore a clock signal must be combined with the data signal. More- over, the formatter, will require certain "out of band" signals to be used as delimiters to mark the start of blocks on the tracks. The Data Separator, sometimes called the encoder/decoder converts data bits to code bits and vice versa. The codes employed are called "self clocking" codes, which they do, in general, by limiting the number of consec- utive code zeros which are allowed to some small number. Early self clocking codes were quite inefficient, and required two flux transitions per data bit. The fundamental limit on recording densities is caused by physical phenomena which limit how close together on the media transitions can physically occur. Yet, the bandwidth of the transducer and the read/write channel can often resolve significantly higher frequencies than could be produced by putting transitions as close together as the medium and transducers allow. Therefore a class of codes generally called Run Length Limited (RLL) codes has been developed which guarantee that the transitions are no closer together than allowed, but which commonly achieve information storage densities as high as 1.5 data bits per minimum flux transition distance. They do this by maintaining a code clock with a period of less than the minimum physically possible time interval between transitions, and using a code which guarantees that at least one (or more, depending upon the code) code zero (no transition) follows each code one (transition). This requires very precise, stable clock recovery circuitry, high bandwidth read/write channels, and quite complex, often adaptive, encoder-decoder circuits, but results in dramatic increases in storage capacity. The interface between the Coding and Physical Recording Levels carries code bits which represent medium state transitions. The interface between the Format and Coding Levels carry uncoded data, clock and out of band synch- ronizing signals to mark the start of separate blocks. Format Layer Data are stored on the recording surface in separate blocks or sectors which can be written independently. A "gap" is required between sectors to permit independent write operations. Figure 2 illustrates the format of a typical sector, which in fact, consists of two independently written blocks, a header block and a data block. Each block begins with a sync field, which is used to lock the recovered clock to the clock used to record. This is followed by a special starting delimiter code, which marks the start of the block. The header block typically contains an identifier field, specifically identifying the block address, a status field, identifying whether the sector is good or the medium contains known flaws in the sector and should not be used, an error detecting code to ensure that the block was read cor- rectly and a trailer field, which is used to provide a continuing read clock for the decoder for a short period following the end of the block. The header block may also contain fields used to indicate the address of a replacement block, should the data block contain a flaw. In some variable block systems, the header block also contains a length field, giving the length of the following data block. The data block is preceded by the gap separating it from the header, and begins with a sync field and a starting delimiter. Then comes the actual user data field, typically followed by an error correcting code (ECC) and a trailer. The surface of magnetic disks is carefully tested and defective sectors are flagged in the header and generally are not used. However, defects may be missed or may occur at later times due to chemical changes, contamination, or physical damage caused by the contact of the head with the medium (sometimes called "head crashes"). Thus errors may occur even after a block has been written and successfully read. Sophisticated error correcting codes are written with the data in an attempt to overcome this problem. They can be used to detect errors in the recorded data, and, if the errors are not too extensive, to correct them. Some format data may be written at the factory (particularly for optical media). Often there is a "hard" format process which uses the Formatter to write the necessary header blocks for the entire device. Subsequent application level data access does not ordinarily change this header information. There is also some- times a file level "soft" format operation in some systems which imposes some further logical file structure in the data sectors, but does not change the headers. One must be particularly care- ful in performing a hard format operation to preserve surface defect information; this is a dangerous operation for the uninit- iated. Soft formats may destroy user information, but rarely render the device unusable. Soft format operations may also generate another level of defect mapping in the file system. Note also that logically successive sectors are not necessarily physically successive. This mapping of logically successive sectors, called interleaving, usually takes place at hard format time and can have a dramatic affect on system performance. The Format Level interface to the Coding Level passes data bits, synchronizing signals and clock signals. It operates in exact lock step to the rotation of the medium surface. The Formatter may convert the bit serial data of the Coding Level to byte or word parallel data, including a serializer/deserializer function (often called a SERDES). If so, this implies a small amount of buffering in the Format Level, and a slight decoupling of the interface to the Storage Control Level from the rotation of the medium. Still, at this point, the two are fairly tightly bound, and this is still very much a <169>real time<170> interface. The data transferred between the Storage Control and Format layers, however, has been broken into identified fields as described above. Primary Error Handling Level At this level, recording errors are detected and sometimes corrected, and known defective sectors are remapped to good media. Defect Mapping. Storage media contain defective spots or areas which cannot be read or written properly. In magnetic media these are determined by extensive testing of the medium, and bad spots can generally be located. Magnetic disk drives generally come from the factory with a list of known bad spots, generally expressed as cylinder, head and bits from index. New bad spots may later occur due to chemical changes, contamination or head crashes. In most magnetic disk systems a status field is included in the header of every sector to flag sectors containing known defects. Some strategy is then used to reassign the sector to some reserve or "spare" sectors. There may also be a header block for the entire track, often called "R0". This may flag defect- ive tracks or specific bad spots on the track. A wide variety of strategies is used to remap defective sectors to good locations or to avoid bad spots. There may be a spare sector or sectors on the same track, there may separate cylinders assigned to spares, or there may be some combin- ation. The header may identify the specific location of the reassigned sector or it may simply flag the sector as defective and some automatic algorithm may then be used to locate the replacement sector. Whatever method is used, when an application tries to access a defective sector, the Error Detection Function detects the error flag and the Defect Mapping Function finds the replacement sector. The problem is more difficult with write once optical media, since the medium cannot be fully tested until after data is written upon it. Optical recording generates much higher "raw" error rates than are tolerated in magnetic recording. Powerful error correcting codes are recorded with the data. Some level of correctable errors are generally tolerated. Never the less, for safety, it is generally necessary to read the data immediately after it is written, to determine if it were satisfactorily recorded. Some optical drives can do this on the same rotation with the write, while others require a second rotation (magnetic drives almost never automatically check data after a write). Some ordinarily unwritten field in the data must then be used to flag a sector which is found to be defective, and it is reassigned to another location on the medium. Error Detection Function When data is recorded, powerful error detecting and error correcting codes are generated by circuitry in this function and recorded with the data. When the data is read, the codes are regenerated in similar fashion, and the regen- erated ECC is effectively compared to the recorded code. If they are not the same, there has been an error. Most of these ECC's are devised so that they can be used to correct some errors which they detect. If there are enough errors in a sector, there is a very small possibility that a really major error will be undetected. With the codes used in magnetic disks, the calculation of the correction, in the rare event of an error, is usually a time consuming, higher level software process. Error correction is a particular concern with write once optical media. There is no way to fully test a write once surface without destroying it. Moreover, raw channel error rates for optical media are much higher than for magnetic media. Therefore very elaborate, extensive, interleaved error correction facilities, which can correct many errors in the same block, are routine in optical devices. Some codes may be used which facilitate real time correction of errors in hardware, as the data is read (Hamming codes are the simplest example). A second level of software error correction may also be applied. A substantial fraction of the total recorded bits on the recording medium are typ- ically devoted to error correction. This is required because of the high code bit error rates. None the less, the code bit areal densities which can be achieved with optical recording are so high, that the achievable data bit densities after correction are still much higher than for magnetic media, and the error rates, after correction, are acceptable. Storage Control Level Below this level, functions are typically implemented primarily in dedicated logic circuits, and require more or less special purpose, dedicated, "real time" interfaces. At this level general purpose programmable devices largely take over. Some functions of this layer may actually be implemented in the host computer, particularly in older systems. When the entire storage control function is outboard from the host computer, then the storage device subsystem may be said to be "intelligent". Although there is still a fairly close performance coupling to the rotation of the storage device (slipped revs. kill perform- ance), these storage peripherals are themselves special purpose computers and can take their place on peer to peer computer interfaces and even local networks, with other computers. Rotational performance considerations may, however, preclude the overhead of general purpose communications protocols, when such devices are placed on LANs and special purpose protocols may be required. The typical 16.7 ms rotation time of hard magnetic disk storage devices may just be too fast for general purpose communications protocols. The latencies of wide area store and forward networks, as well as their typical low data rates, tend to preclude their use as storage interfaces at this level. Address Mapping Function It is useful to shield the host computer File Management Layer from the need to know the detailed cylinder, track and sector geometries of particular storage devices. Therefore some interfaces provide a mapping function where they pre- sent a contiguous linear address space of sectors for an entire volume to the higher layers, and convert this linear sector address to the required cylinder, head and sector address. The higher layer, then, need only know the total storage capacity and logical block size of the device and not its internal structure. Buffering Function While a very small amount of buffering is inherent in the serialization/deserialization process of the formatter, up to the point where entire sectors are buffered in randomly addressable memory, the Data Side interface functions are very tightly coupled to the rotation of the recording medium. Buffering the entire sector, or more, substantially decouples the functions from rotation. This decoupling is not complete; if the next successive sector to be accessed is not successfully buffered, a "slipped revolution" results, with severe performance penalties. In older systems, this buffering first occurred in the main memory of the host computer system and this put severe constraints on the memory to I/O channel interface, and the ability of the host CPU to service I/O interrupts. In more modern systems, this sector buffering occurs in an intermediate controller. This controller may be imbedded in the drive itself, in a stand alone cabinet or on a card in the cardframe of the host computer. There is now a strong tendency to go from simple sector buffers to much larger "cache" buffers. Cache buffers are comparatively large semiconductor memory buffers used to speed apparent access to the rotating storage medium by holding recently accessed sectors in the buffer, and by various strategies for trying to anticipate the requests of the host and preloading them into the buffer. For example, when sector N is read, sector N+1 often follows. Obviously, the success of such schemes is very dependent on file struc- tures, but it is easy to read a whole track into the buffer while obtaining the one sector specifically requested. Error Recovery Function Error detection is generally supported by dedicated feedback shift register hardware, capable of generating the ECC at the recording rate, and participating in the real time recording stream. Error recovery, which is much more time consuming, is generally performed by some sort of program- mable computer operating on buffered read data. That is, the Error Detection Function operates in realtime with recording, while the Error Recovery Function generally does not. If errors are infrequent, then the recovery process can be time consuming, yet have little effect on overall performance. The ECC residues are used to compute a correction to data having "recoverable" errors after it is read. Maximum correction is somewhat in opposition to certain detection of uncorrectable errors and any Error Recovery Function trades increased recovery capability for an increase in the probability of a "false correction," if the ECC is held constant. The Error Recovery Function may also try rereading data sectors until the ECC checks correctly, making micro adjustments to the actuator and retrying the read in the hope of reading correctly or reading many times and averaging the bit values to get the best possible data to which to apply correction. Data manipulation error recovery cannot occur below the level where the entire sector is buffered; with some interfaces this is in the host computer itself, while in others it is in the controller. Volume Management Level This level is traditionally implemented in the host computer, although today it is frequently moved to storage servers. Indeed, the notion of host computer is becoming old fashioned; increasingly, the central element in many systems is the storage server, which may support many computers. Applications typically deal with files, and the file system traditionally views storage as a number of volumes. In most systems, there is considered to be one "on line" volume per actuator. The volume is conceptually the storage medium itself, and it may be removable, therefore all volumes may not be online at any one time. Traditionally, a request for an offline volume has resulted in a message to an operator to mount the volume on some available drive. However, automated volume retrieval devices exist, and their use is growing. There is usually some level of formatting associated with the volume, which is in addition to the low level sector formatting, which involves the creation of labels, directories and, in many cases, maps of sectors into files or available space. This may provide an additional level for mapping around defective media discovered after the <169>hard<170> format operation. File Management Level Applications deal with files. Computer operating systems traditionally provide some file structure. Files may be confined to a single volume or they may be allowed to span volumes. They may have complex interrelationships. Traditionally, all the details of the file systems have been implemented in host computers and the storage subsystem has had no "knowledge" of them. Today, there is a trend to implementing "file servers" or "data base machines," which take over all the details of the management of the storage system. At this level, the services to the Application Level are largely decoupled from storage rota- tion, so even wide area networks and general purpose commun- ications protocols can sometimes be used in the interface, with acceptable performance. In many cases, the storage server or data base machine is simply a general purpose computer dedicated to this task, however some database machines are specially designed for high performance query operations. When a storage subsystem provides an interface at this level, it is almost always attached to the actual rotating storage devices through identifiable standard interfaces at some lower level. Interfaces and Levels The Flexible Disk Interface and the common ST-506 de facto standard lie between the Physical Recording Level and the Coding Level. Several recording codes may be used across these inter- faces, yielding different storage densities for similar media. The ESDI, Rigid Disk and SMD interfaces all are between the Format Level and the Coding Level. Coding is determined in the drive, not the controller. In each of these cases, the host controller acts as the interface master and the device as the slave, but when actual data transfers take place the controller must synchronize itself to the rotation of the recording surface with a precision on the order of a few nanoseconds. These are very much real time interfaces. The IPI-2 interface is between a controller implementing the Error Detection Level and a device including the Format Level. Again the controller is the master and the drive the slave, but when transfers take place, the controller marches in lockstep to the drive surface, with a precision on the order of a micro- second. The original IBM I/O Channel interface lies between the Error Detection and Storage Control Levels. The channel runs between a host computer and a storage controller and the host serves as bus master, but, as ever, the host must slave itself to the contin- uous rotation of the disk with a precision on the order of a microsecond. In addition, it must provide a bus turnaround response within gap times of about 80 microseconds. More recent adaptations of the interface have added buffering in the control- ler and significantly decoupled the interface from lockstep with rotation. The SCSI and IPI-3 interfaces are between the Storage and Volume Management Levels. Now, in the case of SCSI, computer and storage device are peers. In both cases, the only critical event is the overflow or overrun of controller buffers after positioning and rotation (causing slipped revolutions), and the transfer is not slaved to the precise data rate seen by the recording head. The transfer can be faster or slower than the recording rate. In the fastest ordinary disk drives a rotation takes 16.7 milliseconds, and average positioning times are of the same general magnitude, so bus connection and command interpretation times on the order of a millisecond can be tolerated well. The LDDI and FDDI interfaces, with appropriate command protocols, are also suitable network interfaces for this level, as well as for higher level file and storage server devices.