X3T9.2/88-139

NOTE: This document was generated by William E. Burr of NIST (was NBS).  The 
August working group requested that it be distributed to X3T9.2 as a possible 
basis for the Direct-access Device model.  Bill gave me an ASCII version of 
his file which I have reformatted for WordStar.  The figures are not included 
in this version -- John Lohmeyer.


         A Rotating Direct Access Storage Functional Reference Model

Introduction
This model provides a conceptual framework for describing the functions of 
direct access rotating storage, such as magnetic and optical disk drives, used 
in computer systems, for the purpose of describing where I/O interfaces lie in 
the hierarchy of functions.  It is intended to be descriptive of the way in 
which such storage systems are conventionally attached to computer systems. 

At the high end of this model is always an application process running on a 
computer, and requesting access to data, and at the very bottom is always some 
rotating recording surface, containing bit serial tracks of information, and 
read by some physical process (such as magnetization of ferromagnetic 
particles) involving positioning a transducer at the desired location along a  
planar path parallel to the rotating surface (normally by motion using a 
device generically called an actuator, or, less commonly, by selecting one of 
many fixed transducers) and then rotating the surface until some desired 
location passes under the transducer.  Most commonly the rotating surface is a 
disk, in some cases, however, it is a cylinder.  The remaining functions may 
be split in many ways between the software of the host computer, the logic and 
programs of various intermediate controllers, and the logic or processors 
actually imbedded in the drive itself.  Various interface standards facilitate 
this splitting of functions at different levels.

The model, shown in Figure 1, is split in two parallel sides, one, the 
Positioning Side, involved with positioning of the transducer and locating 
specific blocks of data, and the other, the Data Side, concerned with reading 
and writing that data.  The Data Side may feed positioning information back to 
the Positioning Side.  

Physical Recording Level
Actuator
The tracks of data may be either concentric circles or a spirals.  In either 
case some actuator, typically a stepper motor in open loop systems or a servo 
motor in closed loop systems, is used to position the transducer(s) on or over 
some data track.  In magnetic disks a number of parallel surfaces are 
frequently attached to a single shaft and the actuator moves a number of 
transducers, one or more for each surface, all at the same time.  There may be 
more than one actuator per rotating shaft or spindle, in which case each 
actuator usually accesses nonoverlapping, concentric bands of data, and is 
considered to be a separate logical storage device.  Magnetic disks nearly 
always use a concentric circular track arrangement, while optical disks 
frequently use spiral tracks.  With multiple surfaces, the tracks are 
conventionally considered to be arranged as cylinders of tracks, one on each 
surface at the same distance from the axis of rotation, and positioning 
involves the selection of the particular transducer associated with the 
desired data location.  Drives have been built which activate several 
transducers in parallel; they are not common.  Some synchronizing signal, 
detected either by the data transducer or by some separate fixed transducer, 
is used to find the "index" point on the track, and then data is often then 
located by rotational position from that point.

Recording
Two types of recording are in common use: magnetic and optical.  In magnetic 
recording, the transducer must be in close proximity to the recording surface, 
either in contact, as in flexible disk devices, or "flying" on a thin air 
bearing above the recording surface, as in most hard disks. Optical drives can 
locate the transducer at some distance from the surface.  In magnetic 
recording an electric current in the transducer, called a "head," is used to 
magnetize material in the recording surface.  In writable optical drives a 
laser beam is used to produce some change in the state of the surface, often a 
bubble or pit.  In CD ROM drives a stamping process is used to produce pits in 
the surface.
  
When reading magnetic drives, a change in the direction of magnetization, that 
is a flux transition, induces a signal in the head.  An optical surface is 
read by shining a less intense (than required to write) laser beam on the 
surface. In general, what is detected in either case is a state transition, 
whether of the direction of magnetization or the optical characteristics of 
the surface.  The surface is considered to be divided into a number of equally 
sized "bit cells" and a transition detected in the bit cell is usually 
considered to be a code one, while no transition is a zero. 

Coding Level
While the Physical Recording Level is concerned primarily with positioning the 
transducer and using it to produce or sense recording medium state 
transitions, the coding level converts data to the required  states and state 
transitions back to data, as well as, in some cases, using signals in the 
recording surface to adjust the position of the actuator.

Position Control Function
Open loop actuators require no feedback from the recording surface itself for 
positioning, however the track to track densities  and positioning times which 
can be achieved with open loop systems are limited. Every shaft wobbles a 
little as it spins.  Materials in the recording surface and the actuator 
expand and contract with changes in temperature. There are limits to how 
precisely a stepper motor can position itself.  Closed loop systems use a 
signal on the recording surface itself, or on a separate reference surface, or 
both, to generate a feedback signal for a servo motor actuator and actively 
follow the track as the surface rotates.  Much higher track to track densities 
are possible with closed loop positioners.

Position control signals between the Position Control and Actuator functions 
in open loop systems simply consist of a series of plus or minus step pulses, 
one for each track to be moved in or out.  In closed loop systems, a 
continuous plus or minus current is supplied to the servo motor until the 
transducer is properly positioned, and whenever any deviation from the track 
is sensed.  The input to the Position Control function from the higher mapping 
layers is typically the binary number of the desired cylinder position.

Data Separator Function
The code bits formed by the transitions on the recording surface are not 
themselves direct data bits, they are code bits.  At high recording densities, 
it is impossible to maintain a sufficiently precise time base to tell if there 
are precisely 1000 zeroes in a row, or 1001.  Therefore a clock signal must be 
combined with the data signal.  More- over, the formatter, will require 
certain "out of band" signals to be used as delimiters to mark the start of 
blocks on the tracks.  The Data Separator, sometimes called the 
encoder/decoder converts data bits to code bits and vice versa.  The codes 
employed are called "self clocking" codes, which they do, in general, by 
limiting the number of consec- utive code zeros which are allowed to some 
small number.
  
Early  self clocking codes were quite inefficient, and required two flux 
transitions per data bit.  The fundamental limit on recording densities is 
caused by physical phenomena which limit how close together on the media 
transitions can physically occur.  Yet, the bandwidth of the transducer and 
the read/write channel can often resolve significantly higher frequencies than 
could be produced by putting transitions as close together as the medium and 
transducers allow.  Therefore a class of codes generally called Run Length 
Limited (RLL) codes has been developed which guarantee that the transitions 
are no closer together than allowed, but which commonly achieve information 
storage densities as high as 1.5 data bits per minimum flux transition 
distance.  They do this by maintaining a code clock with a period of less than 
the minimum physically possible time interval between transitions, and using a 
code which guarantees that at least one (or more, depending upon the code) 
code zero (no transition) follows each code one (transition).  This requires 
very precise, stable clock recovery circuitry, high bandwidth read/write 
channels, and quite complex, often adaptive, encoder-decoder circuits, but 
results in dramatic increases in storage capacity.

The interface between the Coding and Physical Recording Levels carries code 
bits which represent medium state transitions.  The interface between the 
Format  and Coding Levels carry uncoded data, clock and out of band synch- 
ronizing signals to mark the start of separate blocks.

Format Layer
Data are stored on the recording surface in separate blocks or sectors which 
can be written independently.  A "gap" is required between sectors to permit  
independent write operations.  Figure 2 illustrates the format of a typical 
sector, which in fact, consists of two independently written blocks, a header 
block and a data block.  Each block begins with a sync field, which is used to 
lock the recovered clock to the clock used to record.  This is followed by a 
special starting delimiter code, which marks the start of the block.  The 
header block typically contains an identifier field, specifically identifying 
the block address, a status field, identifying whether the sector is good or 
the medium contains known flaws in the sector and should not be used, an error 
detecting code to ensure that the block was read cor- rectly and a trailer 
field, which is used to provide a continuing read clock for the decoder for a 
short period following the end of the block.  The header block may also 
contain fields used to indicate the address of a replacement block, should the 
data block contain a flaw.  In some variable block systems, the header block 
also contains a length field, giving the length of the following data block.

The data block is preceded by the gap separating it from the header, and 
begins with a sync field and a starting delimiter. Then comes the actual user 
data field, typically followed by an error correcting code (ECC) and a 
trailer.  The surface of magnetic disks is carefully tested and defective 
sectors are flagged in the header and generally are not used.   However, 
defects may be missed or may occur at later times due to chemical changes, 
contamination, or physical damage caused by the contact of the head with the 
medium (sometimes called "head crashes"). Thus errors may occur even after a 
block has been written and successfully read.  Sophisticated error correcting 
codes are written with the data in an attempt to overcome this problem. They 
can be used to detect errors in the recorded data, and, if the errors are not 
too extensive, to correct them.

Some format data may be written at the factory (particularly for optical 
media).  Often there is a "hard" format process which uses the Formatter to 
write the necessary header blocks for the entire device.  Subsequent 
application level data access does not ordinarily change this header 
information.  There is also some- times a file level "soft" format operation 
in some systems which imposes some further logical file structure in the data 
sectors, but does not change the headers.  One must be particularly care- ful 
in performing a hard format operation to preserve surface defect information; 
this is a dangerous operation for the uninit- iated.  Soft formats may destroy 
user information, but rarely render the device unusable.  Soft format 
operations  may also generate another level of defect mapping in the file 
system.  Note also that logically successive sectors are not necessarily 
physically successive.  This mapping of logically successive sectors, called 
interleaving, usually takes place at hard format time and can have a dramatic 
affect on system performance.

The Format Level interface to the Coding Level passes data bits, synchronizing 
signals and clock signals.  It operates in exact lock step to the rotation of 
the medium surface.  The Formatter may convert the bit serial data of the 
Coding Level to byte or word parallel data, including a 
serializer/deserializer function (often called a SERDES).  If so, this implies 
a small amount of buffering in the Format Level, and a slight decoupling of 
the interface to the Storage Control Level from the rotation of the medium.  
Still, at this point, the two are fairly tightly bound, and this is still very 
much a <169>real time<170> interface.  The data transferred between the 
Storage Control and Format layers, however, has been broken into identified 
fields as described above.

Primary Error Handling Level
At this level, recording errors are detected and sometimes corrected, and 
known defective sectors are remapped to good media.

Defect Mapping.
Storage media contain defective spots or areas which cannot be read or written 
properly.  In magnetic media these are determined by extensive testing of the 
medium, and bad spots can generally be located.  Magnetic disk drives 
generally come from the factory with a list of known bad spots, generally 
expressed as cylinder, head and bits from index. New bad spots may later occur 
due to chemical changes, contamination or head crashes.  In most magnetic disk 
systems a status field is included in the header of every sector to flag 
sectors containing known defects.  Some strategy is then used to reassign the 
sector to some reserve or "spare" sectors.  There may also be a header block 
for the entire track, often called "R0".  This may flag defect- ive tracks or 
specific bad spots on the track.  A wide variety of strategies is used to 
remap defective sectors to good locations or to avoid bad spots.  There may be 
a spare sector or sectors on the same track, there may  separate cylinders 
assigned to spares, or there may be some combin- ation.  The header may 
identify the specific location of the reassigned sector or it may simply flag 
the sector as defective and some automatic algorithm may then be used to 
locate the replacement sector.  Whatever method is used, when an application 
tries to access a defective sector, the Error Detection Function detects the 
error flag and the Defect Mapping Function finds the replacement sector.

The problem is more difficult with write once optical media, since the medium 
cannot be fully tested until after data is written upon it.  Optical recording 
generates much higher "raw" error rates than are tolerated in magnetic 
recording. Powerful error correcting codes are recorded with the data. Some 
level of correctable errors are generally tolerated. Never the less, for 
safety, it is generally necessary to read the data immediately after it is 
written, to determine if it were satisfactorily recorded.  Some optical drives 
can do this on the same rotation with the write, while others require a second 
rotation (magnetic drives almost never automatically check data after a 
write).  Some ordinarily unwritten field in the data must then be used to flag 
a sector which is found to be defective, and it is reassigned to another 
location on the medium.

Error Detection Function
When data is recorded, powerful error detecting and error correcting codes are 
generated by circuitry in this function and recorded with the data.  When the 
data is read, the codes are regenerated in similar fashion, and the regen- 
erated ECC is effectively compared to the recorded code.  If they are not the 
same, there has been an error.  Most of these ECC's are devised so that they 
can be used to correct some errors which they detect.  If there are enough 
errors in a sector, there is a very small possibility that a really major 
error will be undetected.  With the codes used in magnetic disks, the 
calculation of the correction, in the rare event of an error, is usually a 
time consuming, higher level software process.

Error correction is a particular concern with write once optical media.  There 
is no way to fully test a write once surface without destroying it.  Moreover, 
raw channel error rates for optical media are much higher than for magnetic 
media.  Therefore very elaborate, extensive, interleaved error correction 
facilities, which can correct many errors in the same block, are routine in 
optical devices.  Some codes may be used which facilitate real time correction 
of errors in hardware, as the data is read (Hamming codes are the simplest 
example).  A second level of software error correction may also be applied.  A 
substantial fraction of the total recorded bits on the recording medium are 
typ- ically devoted to error correction.  This is required because of the high 
code bit error rates.  None the less, the code bit areal densities which can 
be achieved with optical recording are so high, that the achievable data bit 
densities after correction are still much higher than for magnetic media, and 
the error rates, after correction, are acceptable.


Storage Control Level
Below this level, functions are typically implemented primarily in dedicated 
logic circuits, and require more or less special purpose, dedicated, "real 
time" interfaces.  At this level general purpose programmable devices largely 
take over.  Some functions of this layer may actually be implemented in the 
host computer, particularly in older systems.  When the entire storage control 
function is outboard from the host computer, then the storage device subsystem 
may be said to be "intelligent". Although there is still a fairly close 
performance coupling to the rotation of the storage device (slipped revs. kill 
perform- ance), these storage peripherals are themselves special purpose 
computers and can take their place on peer to peer computer interfaces and 
even local networks, with other computers. Rotational performance 
considerations may, however, preclude the overhead of general purpose 
communications protocols, when such devices are placed on LANs and special 
purpose protocols may be required.  The typical 16.7 ms rotation time of hard 
magnetic disk storage devices may just be too fast for general purpose 
communications protocols.  The latencies of wide area store and forward 
networks, as well as their typical low data rates, tend to preclude their use 
as storage interfaces at this level.

Address Mapping Function
It is useful to shield the host computer File Management Layer from the need 
to know the detailed cylinder, track and sector geometries of particular 
storage devices.  Therefore some interfaces provide a mapping function where 
they pre- sent a contiguous linear address space of sectors for an entire 
volume to the higher layers, and convert this linear sector address to the 
required cylinder, head and sector address.  The higher layer, then, need only 
know the total storage capacity and logical block size of the device and not 
its internal structure.

Buffering Function
While a very small amount of buffering is inherent in the 
serialization/deserialization process of the formatter, up to the point where 
entire sectors are buffered in randomly addressable memory, the Data Side 
interface functions are very tightly coupled to the rotation of the recording 
medium.  Buffering the entire sector, or more, substantially decouples the 
functions from rotation.  This decoupling is not complete; if the next 
successive sector to be accessed is not successfully buffered, a "slipped 
revolution" results, with severe performance penalties.  In older systems, 
this buffering first occurred in the main memory of the host computer system 
and this put severe constraints on the memory to I/O channel interface, and 
the ability of the host CPU to service I/O interrupts.  In more modern 
systems, this sector buffering occurs in an intermediate controller. This 
controller may be imbedded in the drive itself, in a stand alone cabinet or on 
a card in the cardframe of the host computer.

There is now a strong tendency to go from simple sector buffers to much larger 
"cache" buffers.  Cache buffers are comparatively large semiconductor memory 
buffers used to speed apparent access to the rotating storage medium by 
holding recently accessed sectors in the buffer, and by various strategies for 
trying to anticipate the requests of the host and preloading them into the 
buffer.  For example, when sector N is read, sector N+1 often follows.  
Obviously, the success of such schemes is very dependent on file struc- tures, 
but it is easy to read a whole track into the buffer while obtaining the one 
sector specifically requested.

Error Recovery Function
Error detection is generally supported by dedicated feedback shift register 
hardware, capable of generating the ECC at the recording rate, and 
participating in the real time recording stream.  Error recovery, which is 
much more time consuming, is generally performed by some sort of program- 
mable computer operating on buffered read data.  That is, the Error Detection 
Function operates in realtime with recording, while the Error Recovery 
Function generally does not.  If errors are infrequent, then the recovery 
process can be time consuming, yet have little effect on overall performance.  
The ECC residues are used to compute a correction to data having "recoverable" 
errors after it is read.  Maximum correction is somewhat in opposition to 
certain detection of uncorrectable errors and any Error Recovery Function 
trades increased recovery capability for an increase in the probability of a 
"false correction," if the ECC is held constant.

The Error Recovery Function may also try rereading data sectors until the ECC 
checks correctly, making micro adjustments to the actuator and retrying the 
read in the hope of reading correctly or reading many times and averaging the 
bit values to get the best possible data to which to apply correction.  Data 
manipulation error recovery cannot occur below the level where the entire 
sector is buffered; with some interfaces this is in the host computer itself, 
while in others it is in the controller.

Volume Management Level
This level is traditionally implemented in the host computer, although today 
it is frequently moved to storage servers. Indeed, the notion of host computer 
is becoming old fashioned; increasingly, the central element in many systems 
is the storage server, which may support many computers.  Applications 
typically deal with files, and the file system traditionally views storage as 
a number of volumes.  In most systems, there is considered to be one "on line" 
volume per actuator.  The volume is conceptually the storage medium itself, 
and it may be removable, therefore all volumes may not be online at any one 
time.  Traditionally, a request for an offline  volume has resulted in a 
message to an operator to mount the volume on some available drive.  However, 
automated volume retrieval devices exist, and their use is growing.  There is 
usually some level of formatting associated with the volume, which is in 
addition to the low level sector formatting, which involves the creation of 
labels, directories and, in many cases, maps of sectors into files or 
available space.  This may provide an additional level for mapping around 
defective media discovered after the <169>hard<170> format operation.

File Management Level
Applications deal with files. Computer  operating systems traditionally 
provide some file structure.  Files may be confined to a single volume or they 
may be allowed to span volumes.  They may have complex interrelationships.  
Traditionally, all the details of the file systems have been implemented in 
host computers and the storage subsystem has had no "knowledge" of them.  
Today, there is a trend to implementing "file servers" or "data base 
machines," which take over all the details of the management of the storage 
system.  At this level, the services to the Application Level are largely 
decoupled from storage rota- tion, so even wide area networks and general 
purpose commun- ications protocols can sometimes be used in the interface, 
with acceptable performance.  In many cases, the storage server or data base 
machine is simply a general purpose computer dedicated to this task, however 
some database machines are specially designed for high performance query 
operations.  When a storage subsystem provides an interface at this level, it 
is almost always attached to the actual rotating storage devices through 
identifiable standard interfaces at some lower level.

Interfaces and Levels
The Flexible Disk Interface and the common ST-506 de facto standard lie 
between the Physical Recording Level and the Coding Level.  Several recording 
codes may be used across these inter- faces, yielding different storage 
densities for similar media. The ESDI, Rigid Disk and SMD interfaces all are 
between the Format Level and the Coding Level.  Coding is determined in the 
drive, not the controller.  In each of these cases, the host controller acts 
as the interface master and the device as the slave, but when actual data 
transfers take place the controller must synchronize itself to the rotation of 
the recording surface with a precision on the order of a few nanoseconds.  
These are very much real time interfaces.

The IPI-2 interface is between a controller implementing the Error Detection 
Level and a device including the Format Level.  Again the controller is the 
master and the drive the slave, but when transfers take place, the controller 
marches in lockstep to the drive surface, with a precision on the order of a 
micro- second.  

The original IBM I/O Channel interface lies between the Error Detection and 
Storage Control Levels.  The channel runs between a host computer and a 
storage controller and the host serves as bus master, but, as ever, the host 
must slave itself to the contin- uous rotation of the disk with a precision on 
the order of a microsecond.  In addition, it must provide a bus turnaround 
response within gap times of about 80 microseconds.  More recent adaptations 
of the interface have added buffering in the control- ler and significantly 
decoupled the interface from lockstep with rotation.

The SCSI and IPI-3 interfaces are between the Storage and Volume Management 
Levels.  Now, in the case of SCSI, computer and storage device are peers.  In 
both cases, the only critical event is the overflow or overrun of controller 
buffers after positioning and rotation (causing slipped revolutions), and the 
transfer is not slaved to the precise data rate seen by the recording head.  
The transfer can be faster or slower than the recording rate.  In the fastest 
ordinary disk drives a rotation takes 16.7 milliseconds, and average 
positioning times are of the same general magnitude, so bus connection and 
command interpretation times on the order of a millisecond can be tolerated 
well.  The LDDI and FDDI interfaces, with appropriate command protocols, are 
also suitable network interfaces for this level, as well as for higher level 
file and storage server devices.