From: James McGrath X3T9.2/91-24 Rev 0 Quantum Corporation 580 Cottonwood Ave Milpitas, CA 95035 To: John Lohmeyer Chairman, X3T9.2 3718 N. Rock Road Wichita, KS 67226 Date: February 18, 1991 Subject: Requirements for Implementing Tagged Command Queuing This document provides a model for a tagged command queuing environment detailed enough to enable target and initiator manufacturers to implement a system with well defined data integrity characteristics. It does not attempt to restrict implementation options beyond that necessary to achieve this limited goal. Specifically, it does not provide a comprehensive model of command queuing detailed enough to enable someone to provide a "standard implementation" of command queuing. My belief is that such an effort would be both premature and would limit value the market can provide to customers. Although this model can be easily adapted to any device type, the direct access disk drive is used to provide a concrete example. This is the first draft, and comments are appreciated. Definition of the State of a SCSI Device The state of any SCSI device must contain enough information to predict both what data will be returned to the user and the manner in which the SCSI bus operates during the execution of any possible set of SCSI commands. Such items as the mode parameters, format parameters, and mapping of user data to logical blocks are included in the state. State information is not sufficient to predict all the access time characteristics of the device. Specifically, the contents of any read cached data need not be contained within the state. The key criteria for data integrity is that the initiator's understanding of the device state matches the actual device state. When these differ, the device is said to be in a non-synchronized state. When these match, the device is said to be in a synchronized state. Points in time that mark the transition from synchronized to non-synchronized are called non-synchronization points. Points in time that mark the transition from non-synchronized to synchronized are called synchronization points. Please note that the term synchronization is used here in a completely different manner than in "synchronous data transfer," or "synchronized spindles." It is only weakly related to "synchronize cache." The relation to data integrity is simple. If the device is synchronized, then anything can be done to it (e.g. reset, power cycle, error condition) without the user losing data it believes has been stored on the media. If the device is in a non-synchronized state, then the user could lose data under at least one condition (e.g. reset, power cycle, error condition). Since loss of power is typically the most severe condition a device can be subjected to, it will be used as the dominant scenario for loss of data integrity. A Simple Example of a Synchronized and non-Synchronized Device As a concrete example, take a system with a single initiator and single target using SCSI-1. Assume the target buffers data for speed matching, but does not implement caching. In this instance the device is synchronized when no I/O process is outstanding at the device (i.e. the device is idle). The initiator has, in theory, enough information (i.e. the entire stream of all past commands to the target) to accurately predict exactly what data will be returned by any subsequent READ command. It also has enough data to anticipate how the SCSI bus will operate so that data transfer can be done correctly (e.g. it knows whether synchronous data transfer has been negotiated). If the device lost power while synchronized, then the data that could be retrieved upon application of power would match the expectations of the user. Data integrity is 100%. However, the device is not synchronized while it is executing a WRITE command. During command execution the initiator does not have enough information to tell what has been written to the media until the final command status is transferred from the device. If power is lost while the write command is in progress, then the user data mapped to the logical blocks being written can be any data at all (the old data, the new data, a mixture of the two, or unreadable data). Data integrity is not 100%. This limitation has always been present in disk devices that use disk controllers with data buffering capabilities. System designers have compensated for it in a variety of ways. One technique is to insure that the drive can never be halted (e.g. lose power) in a non-synchronized state. This has rarely ever been implemented due to the extremely high cost of guarding against all possible failure modes. Instead, file systems have been designed so that the last data written to the drive can always be lost without violating the integrity of the file system. While the user data within those blocks can be lost, all other data can still be accessed. Note that with the exception of these blocks, the system would not be able to distinguish between losing power before or during the last command. Of course, the the user who input the data would still detect its absence. Synchronization and Queuing The introduction of command queuing (either multiple initiator non-tagged or tagged command queuing) increases the amount of time the device spends in a non-synchronized state. Once again, when the device is in the idle state (no commands outstanding at the device), the device is synchronized. Whenever any I/O process is outstanding at the device, the device may be in a non-synchronized state. The initiator does in practice have some control over whether the device is synchronized when multiple I/O processes are outstanding at the device. If no command bytes have yet been transferred for an I/O process, then the target could not have changed the drive state in response to the command. Thus "queuing" multiple I/O processes, but only allowing the SCSI bus transaction to proceed past the initial nexus identification for one command at a time, would reduce the data integrity risks to the same ones as in a non-queuing environment. Alas, this conservative implementation also eliminates most of the benefits of command queuing. The next level of data integrity risk would be the transferring of all command data for each I/O process. If the I/O process are either untagged or unordered and tagged, then the drive could reorder the commands in a manner that would result in large periods of non-synchronization. Imagine the execution of a queued FORMAT UNIT and a queued MODE SELECT command. In practice the greatest elements of risk can be eliminated by either than target or the initiator. The target can refuse to overlap or reorder the device state transitions of any commands other than READ, WRITE, and WRITE AND VERIFY. While this may be a wise implementation strategy, nothing in the standard limits the target to such an implementation. The initiator can also limit the target by not allowing the queuing of command sequences that contain any command other than READ, WRITE, and WRITE AND VERIFY ordered tagged commands. Note that the tagged queuing protocol can still be used - the initiator would simply hold off sending any command until the current non READ/WRITE/WRITE AND VERIFY command finishes executing. Note that it is device state transition, not command execution, that is important. READ commands typically do not result in device state transitions. Thus read data for ordered tagged READ commands can be gathered from the media in any order, since without state transitions that initiator could never tell the difference between any particular ordering the target would impose. For WRITE commands the device state is always in transition. Thus executing a sequence of ordered tagged WRITE commands must result in the same set of state transitions as the execution of the same sequence of WRITE commands in a non-queuing environment. Since the reception of data from the host does not result in a state transition, the target can request data for the WRITE commands in any order. In this respect it is similar to the previous example involving READ commands. Note that the above analysis indicates that it is always proper for the target to request or send data for any command. This is true regardless of whether the command is an ordered, unordered, or head of queue tagged command. More generally, within a sequence of commands containing the ordered command X, the state transitions associated with all commands prior to X (..., X-2, X-1) must all be completed before any state transition for command X can take place. In addition, the status for all commands prior to X must be returned to the initiator before the status for command X. Within a sequence of unordered commands the state transitions associated with X may proceed in any order with respect to the state transitions for prior commands. Note that the device can process state transitions in any order at all. Since each command can involve multiple state transitions, multiple commands can be partially executed at any given time. Status for commands can be returned in any order, but status for command X can only be returned after all of the state transitions for command X have taken place. A head of queue command should be executed atomically (all of its state transitions and return status completed before performing any state transitions or returning status for any other command). Once the device recognizes that a head of queue command has been sent, then it shall halt state transitions for other commands as quickly as practical and shall not return status for any other command until after status is returned for the head of queue command. Note that only the return of status is directly visible to the initiator. The initiator cannot make any assumptions regarding the order with respect to the head of queue command of state transitions made for other commands. Tagged Queuing Examples The following time line illustrates some of the above principles. The abbreviations used are: CR[i] = command reception (getting the command bytes) for command i ST[i] = state transitions for command i CT[i] = command termination (sending status) for command i R = read command W = write command C = other command (e.g. FORMAT UNIT, TEST UNIT READY) O = ordered command U = unordered command Suppose the sequence of commands sent by the initiator is: command number: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 command: UR, UR, OC, OW, OW, OW, UW, UR, UW, OR Then the following precedence graphs delineate the points in time at which the three events associated with each command - CR, ST, and CT - can take place: for i = 0 to 8: CRi -> CRi+1 (from definition of command numbering) for i = 0 to 9: CRi -> STi -> CTi (always true for all commands) ST[0] --\ /-> ST[6] --\ ST[2] --> ST[3] --> ST[4] --> ST[5] --> ST[7] --> ST[9] ST[1] --/ \-> ST[8] --/ CT[0] --\ /-> CT[6] --\ CT[2] --> CT[3] --> CT[4] --> CT[5] --> CT[7] --> CT[9] CT[1] --/ \-> CT[8] --/ These relationship reveal more about what is allowed than what is not allowed. In particular, all commands could be received around time T[0] and terminate around time T[N]. During the intervening time the only restriction is on the state transitions. There are no restrictions between the execution of bus sequences for the various command, nor between the bus sequences and the state transitions. The only source of additional constraint is the need for state transitions to be preceded by some bus sequencing (i.e. you have to get the write data from the initiator before writing it to the media). Throughout this entire period of time the device is in an non-synchronized state. The initiator can limit the non-synchronized times only by throttling data to the target (i.e. without write data the corresponding state transitions cannot take place). This can be accomplished by using the DISCONNECT message when the target requests data. Data Integrity Implications If power is lost while multiple I/O Processes are outstanding at the device, then the device is left in a non-synchronized state. If the commands were unordered, then the state transitions forced by the commands could have occurred in any order. If the commands were ordered, then the state transitions can only occur in a certain order, but the initiator will not know how many transitions had occurred. Note that if ordered commands are used, then the data integrity liability is almost the same as that encountered in the simple non command queuing case. The only additional complication is that for N queued commands, the last I commands (I = 0 to N) may have not been written to the disk rather than simply the last command. Fortunately, operating systems have already solved this problem. Since all operating systems (even simple ones like DOS) employ some form of I/O buffering, system software already must safeguard against the inability to execute the last set of N commands. So using ordered command queuing should introduce no greater data integrity problem than system designers already experience in a non command queuing environment. If a condition occurs that halts the execution of a stream of ordered commands (e.g. device power loss, device error, bus reset), then an initiator recovery strategy that simply retries every I/O process outstanding at the device at the time of failure would always provide adequate recovery. For added safety, the initiator might force these I/O processes to be executed one at a time. This can be done by simply refusing to initiate an I/O process (i.e. perform the SELECTION) until the device is idle. This will eliminate the possibility that it was the act of queuing the commands that generated the initial failure. Unordered command queuing introduces a new dimension of potential problems. State transitions could occur in any order. While this is fine if all the state transitions are allowed to occur, a failure will leave the device in a state where any of the transitions may have occurred in any order. Current operating system have limited tolerance for devices left in this state. One possible response is to make all commands that could force state transitions ordered - all other commands could be unordered. For example, WRITEs would be ordered and READs unordered. Note that while READs could affect state transitions through auto-reallocation, the order in which these transitions occur is not visible to the initiator and so do not matter. The problem with this approach is that performance would suffer. Not only could WRITEs not be reordered, but READs separated by WRITEs could not be reordered. Another possible response is to make only a limited set of commands that force state transitions ordered. Assume all non READ/WRITE commands are ordered. Since they constitute a minute percentage of executed commands, the performance impact is negligible. In addition, some WRITE commands should also be ordered. Typically operating systems are very sensitive to the ordering of some WRITE commands, but not to others. Directory information updates are representative of the former, end user data updates of the latter. Note that any operating system that currently allows the reordering of commands should have already classified WRITE commands into these two categories, so propagating this distinction down to the embedded disk drive should entail a minimal amount of work. Non Power Loss Failure Modes While power loss is the most catastrophic error condition, errors during command execution will also introduce inconsistencies by preventing certain state transitions from occurring. The initiator should configure the device (via the MODE SELECT command) to halt after sending any non GOOD status condition. At this point the device is in a non-synchronized state - the initiator does not know what state transitions for the outstanding commands and the command in error have occurred. After getting additional sense data via a REQUEST SENSE command (executed by bypassing the stalled queue), the initiator should clear the queue (by executing the CLEAR QUEUE message). If a condition occurs that halts the execution of a stream of ordered commands (e.g. device power loss, device error, bus reset), then an initiator recovery strategy that simply retries every I/O process outstanding at the device at the time of failure would always provide adequate recovery. For added safety, the initiator might force these I/O processes to be executed one at a time. This can be done by simply refusing to initiate an I/O process (i.e. perform the SELECTION) until the device is idle. This will eliminate the possibility that it was the act of queuing the commands that generated the initial failure.