From:	James McGrath				X3T9.2/91-24  Rev 0
	Quantum Corporation
	580 Cottonwood Ave
	Milpitas, CA 95035

To:	John Lohmeyer
	Chairman, X3T9.2
	3718 N. Rock Road
	Wichita, KS 67226

Date:	February 18, 1991

Subject: Requirements for Implementing Tagged Command Queuing




This document provides a model for a tagged command queuing
environment detailed enough to enable target and initiator
manufacturers to implement a system with well defined data integrity
characteristics.  It does not attempt to restrict implementation
options beyond that necessary to achieve this limited goal.
Specifically, it does not provide a comprehensive model of command
queuing detailed enough to enable someone to provide a "standard
implementation" of command queuing.  My belief is that such an
effort would be both premature and would limit value the market can
provide to customers.

Although this model can be easily adapted to any device type, the
direct access disk drive is used to provide a concrete example.  This
is the first draft, and comments are appreciated.



Definition of the State of a SCSI Device

    The state of any SCSI device must contain enough information to
    predict both what data will be returned to the user and the
    manner in which the SCSI bus operates during the execution of
    any possible set of SCSI commands.  Such items as the mode 
    parameters, format parameters, and mapping of user data to
    logical blocks are included in the state.  State information is
    not sufficient to predict all the access time characteristics of
    the device.  Specifically, the contents of any read cached data
    need not be contained within the state.

    The key criteria for data integrity is that the initiator's
    understanding of the device state matches the actual device
    state.  When these differ, the device is said to be in a
    non-synchronized state.  When these match, the device is said to
    be in a synchronized state.  Points in time that mark the
    transition from synchronized to non-synchronized are called
    non-synchronization points.  Points in time that mark the
    transition from non-synchronized to synchronized are called
    synchronization points.

    Please note that the term synchronization is used here in a
    completely different manner than in "synchronous data transfer,"
    or "synchronized spindles."  It is only weakly related to
    "synchronize cache."

    The relation to data integrity is simple.  If the device is
    synchronized, then anything can be done to it (e.g. reset, power
    cycle, error condition) without the user losing data it believes
    has been stored on the media.  If the device is in a
    non-synchronized state, then the user could lose data under at
    least one condition (e.g. reset, power cycle, error condition).
    Since loss of power is typically the most severe condition
    a device can be subjected to, it will be used as the dominant
    scenario for loss of data integrity. 




A Simple Example of a Synchronized and non-Synchronized Device

    As a concrete example, take a system with a single initiator and
    single target using SCSI-1.  Assume the target buffers data for
    speed matching, but does not implement caching.

    In this instance the device is synchronized when no I/O process
    is outstanding at the device (i.e. the device is idle).  The
    initiator has, in theory, enough information (i.e. the entire
    stream of all past commands to the target) to accurately predict
    exactly what data will be returned by any subsequent READ
    command.  It also has enough data to anticipate how the SCSI bus
    will operate so that data transfer can be done correctly (e.g. it
    knows whether synchronous data transfer has been negotiated).

    If the device lost power while synchronized, then the data that
    could be retrieved upon application of power would match the
    expectations of the user.  Data integrity is 100%.

    However, the device is not synchronized while it is executing a
    WRITE command.  During command execution the initiator does not
    have enough information to tell what has been written to the
    media until the final command status is transferred from the
    device.  If power is lost while the write command is in progress,
    then the user data mapped to the logical blocks being written can
    be any data at all (the old data, the new data, a mixture of the
    two, or unreadable data).  Data integrity is not 100%.

    This limitation has always been present in disk devices that use
    disk controllers with data buffering capabilities.  System
    designers have compensated for it in a variety of ways.  One
    technique is to insure that the drive can never be halted (e.g.
    lose power) in a non-synchronized state.  This has rarely ever
    been implemented due to the extremely high cost of guarding
    against all possible failure modes.

    Instead, file systems have been designed so that the last data
    written to the drive can always be lost without violating the
    integrity of the file system.  While the user data within those
    blocks can be lost, all other data can still be accessed.  Note
    that with the exception of these blocks, the system would not be
    able to distinguish between losing power before or during the
    last command.  Of course, the the user who input the data would
    still detect its absence.




Synchronization and Queuing

    The introduction of command queuing (either multiple initiator
    non-tagged or tagged command queuing) increases the amount of
    time the device spends in a non-synchronized state.  Once again,
    when the device is in the idle state (no commands outstanding at
    the device), the device is synchronized.  Whenever any I/O
    process is outstanding at the device, the device may be in a
    non-synchronized state.

    The initiator does in practice have some control over whether the
    device is synchronized when multiple I/O processes are outstanding
    at the device.  If no command bytes have yet been transferred for
    an I/O process, then the target could not have changed the drive
    state in response to the command.  Thus "queuing" multiple I/O
    processes, but only allowing the SCSI bus transaction to proceed
    past the initial nexus identification for one command at a time,
    would reduce the data integrity risks to the same ones as in a
    non-queuing environment.  Alas, this conservative implementation
    also eliminates most of the benefits of command queuing.

    The next level of data integrity risk would be the transferring
    of all command data for each I/O process.  If the I/O process
    are either untagged or unordered and tagged, then the drive could
    reorder the commands in a manner that would result in large
    periods of non-synchronization.  Imagine the execution of a
    queued FORMAT UNIT and a queued MODE SELECT command.

    In practice the greatest elements of risk can be eliminated by
    either than target or the initiator.  The target can refuse to
    overlap or reorder the device state transitions of any commands
    other than READ, WRITE, and WRITE AND VERIFY.  While this may be
    a wise implementation strategy, nothing in the standard limits
    the target to such an implementation.

    The initiator can also limit the target by not allowing the queuing
    of command sequences that contain any command other than READ,
    WRITE, and WRITE AND VERIFY ordered tagged commands.  Note that
    the tagged queuing protocol can still be used - the initiator
    would simply hold off sending any command until the current non
    READ/WRITE/WRITE AND VERIFY command finishes executing.

    Note that it is device state transition, not command execution,
    that is important.  READ commands typically do not result in 
    device state transitions.  Thus read data for ordered tagged READ
    commands can be gathered from the media in any order, since
    without state transitions that initiator could never tell the
    difference between any particular ordering the target would
    impose.

    For WRITE commands the device state is always in transition.
    Thus executing a sequence of ordered tagged WRITE commands must 
    result in the same set of state transitions as the execution of
    the same sequence of WRITE commands in a non-queuing
    environment.  Since the reception of data from the host does not
    result in a state transition, the target can request data for the
    WRITE commands in any order.  In this respect it is similar to
    the previous example involving READ commands.

    Note that the above analysis indicates that it is always proper
    for the target to request or send data for any command.  This is
    true regardless of whether the command is an ordered, unordered,
    or head of queue tagged command.

    More generally, within a sequence of commands containing the
    ordered command X, the state transitions associated with all
    commands prior to X (..., X-2, X-1) must all be completed before
    any state transition for command X can take place.  In addition,
    the status for all commands prior to X must be returned to the
    initiator before the status for command X. 

    Within a sequence of unordered commands the state transitions
    associated with X may proceed in any order with respect to the
    state transitions for prior commands.  Note that the device
    can process state transitions in any order at all.  Since
    each command can involve multiple state transitions, multiple
    commands can be partially executed at any given time.  Status
    for commands can be returned in any order, but status for command
    X can only be returned after all of the state transitions for 
    command X have taken place.

    A head of queue command should be executed atomically (all of its
    state transitions and return status completed before performing
    any state transitions or returning status for any other command).
    Once the device recognizes that a head of queue command has been
    sent, then it shall halt state transitions for other commands as
    quickly as practical and shall not return status for any other
    command until after status is returned for the head of queue command.
    Note that only the return of status is directly visible to the
    initiator.  The initiator cannot make any assumptions regarding
    the order with respect to the head of queue command of state
    transitions made for other commands.




Tagged Queuing Examples


    The following time line illustrates some of the above principles.
    The abbreviations used are:

        CR[i] = command reception (getting the command bytes) for command i
        ST[i] = state transitions for command i
        CT[i] = command termination (sending status) for command i
        R = read command
        W = write command
        C = other command (e.g. FORMAT UNIT, TEST UNIT READY)
        O = ordered command
        U = unordered command

    Suppose the sequence of commands sent by the initiator is:

    command number:   0,  1,  2,  3,  4,  5,  6,  7,  8,  9
    command:         UR, UR, OC, OW, OW, OW, UW, UR, UW, OR

    Then the following precedence graphs delineate the points in time
    at which the three events associated with each command - CR, ST,
    and CT - can take place:


    for i = 0 to 8: CRi -> CRi+1 (from definition of command numbering)

    for i = 0 to 9: CRi -> STi -> CTi (always true for all commands)


    ST[0] --\                                     /-> ST[6] --\
              ST[2] --> ST[3] --> ST[4] --> ST[5] --> ST[7] --> ST[9]
    ST[1] --/                                     \-> ST[8] --/


    CT[0] --\                                     /-> CT[6] --\
              CT[2] --> CT[3] --> CT[4] --> CT[5] --> CT[7] --> CT[9]
    CT[1] --/                                     \-> CT[8] --/


    These relationship reveal more about what is allowed than what is
    not allowed.  In particular, all commands could be received around
    time T[0] and terminate around time T[N].  During the intervening
    time the only restriction is on the state transitions.  There are
    no restrictions between the execution of bus sequences for the
    various command, nor between the bus sequences and the state
    transitions.  The only source of additional constraint is the
    need for state transitions to be preceded by some bus sequencing
    (i.e. you have to get the write data from the initiator before
    writing it to the media).

    Throughout this entire period of time the device is in an
    non-synchronized state.  The initiator can limit the
    non-synchronized times only by throttling data to the target
    (i.e.  without write data the corresponding state transitions
    cannot take place).  This can be accomplished by using the
    DISCONNECT message when the target requests data.




Data Integrity Implications

    If power is lost while multiple I/O Processes are outstanding at
    the device, then the device is left in a non-synchronized state.
    If the commands were unordered, then the state transitions forced
    by the commands could have occurred in any order.  If the
    commands were ordered, then the state transitions can only occur
    in a certain order, but the initiator will not know how many
    transitions had occurred.

    Note that if ordered commands are used, then the data integrity
    liability is almost the same as that encountered in the simple
    non command queuing case.  The only additional complication is
    that for N queued commands, the last I commands (I = 0 to N) may
    have not been written to the disk rather than simply the last
    command.  Fortunately, operating systems have already solved this
    problem.  Since all operating systems (even simple ones like DOS)
    employ some form of I/O buffering, system software already must
    safeguard against the inability to execute the last set of N
    commands.  So using ordered command queuing should introduce no
    greater data integrity problem than system designers already
    experience in a non command queuing environment.

    If a condition occurs that halts the execution of a stream of
    ordered commands (e.g. device power loss, device error, bus
    reset), then an initiator recovery strategy that simply retries
    every I/O process outstanding at the device at the time of
    failure would always provide adequate recovery.  For added
    safety, the initiator might force these I/O processes to be
    executed one at a time.  This can be done by simply refusing to
    initiate an I/O process (i.e.  perform the SELECTION) until the
    device is idle.  This will eliminate the possibility that it was
    the act of queuing the commands that generated the initial
    failure.

    Unordered command queuing introduces a new dimension of potential
    problems.  State transitions could occur in any order.  While
    this is fine if all the state transitions are allowed to occur, a
    failure will leave the device in a state where any of the
    transitions may have occurred in any order.  Current operating
    system have limited tolerance for devices left in this state.

    One possible response is to make all commands that could force
    state transitions ordered - all other commands could be
    unordered.  For example, WRITEs would be ordered and READs
    unordered.  Note that while READs could affect state transitions
    through auto-reallocation, the order in which these transitions
    occur is not visible to the initiator and so do not matter.

    The problem with this approach is that performance would suffer.
    Not only could WRITEs not be reordered, but READs separated by
    WRITEs could not be reordered.

    Another possible response is to make only a limited set of
    commands that force state transitions ordered.  Assume all non
    READ/WRITE commands are ordered.  Since they constitute a minute
    percentage of executed commands, the performance impact is
    negligible.  In addition, some WRITE commands should also be
    ordered.  Typically operating systems are very sensitive to the
    ordering of some WRITE commands, but not to others.  Directory
    information updates are representative of the former, end user
    data updates of the latter.  Note that any operating system that
    currently allows the reordering of commands should have already
    classified WRITE commands into these two categories, so
    propagating this distinction down to the embedded disk drive
    should entail a minimal amount of work.




Non Power Loss Failure Modes

    While power loss is the most catastrophic error condition, errors
    during command execution will also introduce inconsistencies by
    preventing certain state transitions from occurring.  The
    initiator should configure the device (via the MODE SELECT
    command) to halt after sending any non GOOD status condition.  At
    this point the device is in a non-synchronized state - the
    initiator does not know what state transitions for the
    outstanding commands and the command in error have occurred.

    After getting additional sense data via a REQUEST SENSE command
    (executed by bypassing the stalled queue), the initiator should
    clear the queue (by executing the CLEAR QUEUE message).

    If a condition occurs that halts the execution of a stream of
    ordered commands (e.g. device power loss, device error, bus
    reset), then an initiator recovery strategy that simply retries
    every I/O process outstanding at the device at the time of
    failure would always provide adequate recovery.  For added
    safety, the initiator might force these I/O processes to be
    executed one at a time.  This can be done by simply refusing to
    initiate an I/O process (i.e.  perform the SELECTION) until the
    device is idle.  This will eliminate the possibility that it was
    the act of queuing the commands that generated the initial
    failure.