.PH //// .PF "/RAID-3 using SCSI drives//Page %/" .nf May 7, 1990 TO: X3T9.2 Committee FROM: Thomas Wicklund / Ciprico Inc. Subject: Issues involved with RAID-3 disk arrays using SCSI drives This paper deals with issues involved when implementing a disk array using SCSI disks. It will focus on a RAID-3 type disk array. A disk array causes several physical disk drives to look to the host like a single large disk drive. Disk arrays also typically provide some degree of redundancy via an extra disk drive which can be used to continue operating if a disk fails. A RAID-3 disk array writes data in parallel to multiple disk drives, thus increasing data transfer rate. A block of data from the host is divided between the disk drives, typically on a byte by byte basis. There have been several RAID-3 implementations in the industry, including the Micropolis 1804 (now discontinued), Ciprico Rimfire 6600, Imprimis Array Master, and products from Maximum Strategy and Storage Concepts. Most array products up until now have been implemented using a device level interface such as ESDI or IPI-2. Since SCSI is the most popular low cost disk interface, there is a lot of interest in implementing RAID-3 products using SCSI disks. There are a number of issues involved in implementing a RAID-3 device using SCSI disks. This paper concentrates on problems when running multiple SCSI disks on each SCSI bus in the array. A SCSI to SCSI bridge controller is used as an example. Please note that SCSI is not the ideal choice for a RAID-3 product for a number of reasons. The big plus that SCSI has is its low cost, both of disk drives (price equal to ESDI) and controller hardware (SCSI protocol chips are much cheaper than disk sequencers). .bp The basic structure of a SCSI to SCSI RAID-3 bridge controller is shown below. PORT 1 PORT 2 PORT 3 PORT 4 PORT 5 +------+ +------+ +------+ +------+ +------+ RANK 3 | SCSI | | SCSI | | SCSI | | SCSI | | SCSI | | DISK | | DISK | | DISK | | DISK | | DISK | +------+ +------+ +------+ +------+ +------+ | | | | | +------+ +------+ +------+ +------+ +------+ RANK 2 | SCSI | | SCSI | | SCSI | | SCSI | | SCSI | | DISK | | DISK | | DISK | | DISK | | DISK | +------+ +------+ +------+ +------+ +------+ | | | | | +------+ +------+ +------+ +------+ +------+ RANK 1 | SCSI | | SCSI | | SCSI | | SCSI | | SCSI | | DISK | | DISK | | DISK | | DISK | | DISK | +------+ +------+ +------+ +------+ +------+ | | | | | | | | | | +---------------------------------------------------+ | SCSI PORTS AND | | PARITY / BYTE SPLIT LOGIC | +---------------------------------------------------+ | | +------+ | HOST | | SCSI | | PORT | +------+ The host side is a SCSI port, probably wide or fast SCSI in order to handle the data rate. The device side consists of a number of SCSI ports (5 in this example). Most disk operations sent from the host are executed by sending the same operation to one disk on each SCSI port. In this example each physical disk would read and write blocks which are 1/4 the size of host blocks, with the fifth disk providing parity. Each device side SCSI port may have one or more drives attached to it. Each group of drives is termed a "rank" for the rest of this paper. The RAID-3 controller may be implemented in a number of ways. The least expensive involves minimal buffering on the controller. In this implementation data transfers from the host to disks by selecting the corresponding disk on each SCSI bus then reading or writing data. The critical requirement is that the corresponding disk from each SCSI port must be selected at the same time. A RAID-3 controller will enforce several restrictions on the disks which are attached in order to ease implementation. These are briefly: 1. All disks in a given rank must be identical. This means identical manufacturer and model. 2. A given rank must be fully populated (4 or 5 disks) in order to operate at all. 3. Drives within a single rank are normally synchronized using spindle sync. The controller may also be able to operate without spindle sync. Implementation decisions used in this example are: 1. Only issue one command at a time to drives in each rank (no tagged commands). This avoids deadlocks if the drives execute commands in a different order. 2. Multiple drives may be active on a given port (overlapped seeks). 3. Multiple ranks are treated as multiple logical units. 4. Performance features such as readahead are assumed to be implemented by the drive. PROBLEMS USING SCSI DISKS: If one SCSI disk is placed on each SCSI port in the RAID-3 controller there are no major implementation problems. However, a typical requirement is to allow multiple disks attached to each SCSI port, thus increasing capacity without requiring an extra controller. When multiple SCSI disks are attached several problems exist which don't exist when using device level interfaces. These problems occur because the SCSI target device controls reselection (while in device level interfaces the controller handles all selection). Thus it is possible for drives from different ranks to be selected on each SCSI port. Since the RAID-3 device requires that drives from all ports be selected in order to operate, a method is needed to avoid deadlock. Potential deadlock cases are outlined below. The examples here assume that one port does something different, but the same arguments apply to two or more ports. 1. The controller attempts to select a rank but on one port a different drive reselects. 2. Drives from one rank attempt to reselect with read data. On one port a drive in a different rank reselects with data from a different command. 3. Drives from one rank attempt to reselect to request write data. On one port a drive in a different rank reselects for a different command. The problem to be solved is how to get out of the deadlock situation. SOLUTIONS USING CURRENT SCSI (or very minor modifications): The following are solutions which I can see using SCSI as currently defined (proposed SCSI-2). All provide a way for the initiator to control selection using some algorithm to determine which rank to service next. A. Direct Control Of Arbitration An obvious solution is to directly control arbitration on all ranks and only allow the drives of a given rank to win arbitration if all drives are arbitrating at the same time. I'm not sure this can be implemented without custom hardware and cabling (or perhaps not at all). In general, an implementation using off the shelf SCSI protocol chips is preferred. This implementation also fails if a second controller is attached to the SCSI bus (2 initiators for redundancy). B. Initiator DISCONNECT message The initiator can send a disconnect message to drives which it wants to wait until later. This would be done by sending a DISCONNECT to the target after receiving the IDENTIFY message if controller isn't ready for the drive. Problems with this solution include: 1. The DISCONNECT message causes the drive to go away for 200us. This could delay reselection briefly. 2. A given drive might repeatedly reselect and be disconnected if waiting for a slow response from another rank. This also adds to controller overhead. C. Pre Fetch command Problems associated with selection and read operations may be reduced by implementing reads as Pre Fetch (with disconnect) followed by Read without disconnect (with data assumed to come from target cache). Then any reselection will be assumed not to have any data associated with it (a Pre Fetch completing). The controller would select all drives to either issue a Pre Fetch with disconnect or a read without disconnect. Write operations could be handled through one of: 1. Don't disconnect during writes. Possibly preceed by a SEEK command to go to the location, however SEEK is currently a diagnostic operation. 2. Don't disconnect between the CDB and write data transfer. Only disconnect after all data is sent and waiting for completion. This either means adding a requirement when selecting drives or adding a new DTDC value in the Disconnect / Reselect mode page. Problems with this solution include: 1. In general, two commands for every operation. This adds significant overhead to the system. 2. Long data operations (> drive buffer) must be done without disconnect. D. LARGE CONTROLLER BUFFER Deadlocks from reselect aren't a problem if the controller has a large enough buffer attached to each SCSI port. Then it can transfer data for each port independently, prefetching write data and assembling read data when all is available. Problems with this solution are: 1. As a rule of thumb, the controller must have enough buffer to hold data for all outstanding commands. If the controller is being used in 1MB reads and writes, the controller needs a 7MB buffer. Since RAID-3 devices are used for high transfer rate, requests will tend to be long. 2. If outstanding commands exceed buffer size, the controller must reduce drive activity to avoid deadlock. 3. Buffers are very expensive. Summary of solutions using current SCSI: These solutions require one of the following strategies: - Custom hardware. - Initiator sent DISCONNECT - Never disconnect if a data phase may follow (basically ensure that all reselections are short and non-data). - Large buffers. All of these solutions tend to reduce performance by either increasing SCSI bus activity (lots of disconnects) or reducing bus efficiency (by not disconnecting). SOLUTIONS INVOLVING CHANGES TO SCSI: The changes required must allow the initiator to control reselection. the changes are: 1. Tell the target not to reselect. a. Tell the target at initial selection not to reselect after disconnects. This would be used if the initiator can determine when to have the target continue without information from the target. b. On a reselection, to tell the target to go away until the initiator wants it to continue. 2. Allow the initiator to select the target while a command is in progress (select without aborting current I/O process). Some possible solutions are outlined here. A. NCR proposal The NCR proposal adds two messages: 1. DO NOT CALL ME (DNCM) to prevent target reselection. 2. CONTINUE after a selection to tell the target that this is These two messages perform the functions listed above. B. Modified DISCONNECT and selection The DISCONNECT message could be extended to mean "disconnect and don't reselect again". This could be done via a new parameter in the Disconnect / Reselect mode page. This extension would handle case 1b above. Case 1a might be handled by a sequence like: a. Target sends DISCONNECT message. b. Initiator responds with ATN and DISCONNECT message. c. Target goes to BUS FREE. This solution is rather inelegent since it changes the semantics of the initiator sent DISCONNECT message. Case 2 above could be handled by one of: a. Add a new form of selection at the low level (select with either MSG or C/D asserted) to identify a continued selection. However, requires protocol chip changes. b. Interpret a reselect from the initiator as a "continue I/O process". Unfortunately, this reverses the initiator and target role at the low level, violating much of the philosophy of SCSI. c. Have the initiator send a command (similar to AEN) telling the target to go ahead and reselect. This is potentially high overhead. d. Allow selection to an existing I/O process (perhaps only after initiator DISCONNECT message or under mode page control). However, this may cause problems with RESET recovery. e. Add a bit to IDENTIFY informing the target that this selection is continuing an I/O process. f. Have the initiator select the existing I/O process then send an existing message which is re-interpreted as "continue". Candidates are DISCONNECT, INITIATE RECOVERY, or RESTORE POINTERS. d. Add a message to continue an existing I/O process. SUMMARY This paper summarizes reasons for initiator controlled reselection, specifically in the case of RAID-3 disk arrays. It summarizes the reasons for this feature, solutions possible using currently defined SCSI, and some possible extensions to SCSI to handle this situation. None of these solutions is particularly good or compatible with the "philosophy" of SCSI. Unfortunately, the marketplace is more concerned with cost and buzzwords. SCSI is low cost and it's the current "hot" technology. SCSI products solving the above problems will be marketed, the question is whether they will use vendor specific hooks, suffer performance problems, or unnecessarily complicate the product.