.PH ////
.PF "/RAID-3 using SCSI drives//Page %/"
.nf
May 7, 1990

TO:     X3T9.2 Committee

FROM:   Thomas Wicklund / Ciprico Inc.

Subject: Issues involved with RAID-3 disk arrays using SCSI drives


This paper deals with issues involved when implementing a disk
array using SCSI disks.  It will focus on a RAID-3 type disk
array.

A disk array causes several physical disk drives to look to the
host like a single large disk drive.  Disk arrays also typically
provide some degree of redundancy via an extra disk drive which
can be used to continue operating if a disk fails.

A RAID-3 disk array writes data in parallel to multiple disk
drives, thus increasing data transfer rate.  A block of data from
the host is divided between the disk drives, typically on a byte
by byte basis.

There have been several RAID-3 implementations in the industry,
including the Micropolis 1804 (now discontinued), Ciprico Rimfire
6600, Imprimis Array Master, and products from Maximum Strategy
and Storage Concepts.

Most array products up until now have been implemented using a
device level interface such as ESDI or IPI-2.  Since SCSI is the
most popular low cost disk interface, there is a lot of interest
in implementing RAID-3 products using SCSI disks.

There are a number of issues involved in implementing a RAID-3
device using SCSI disks.  This paper concentrates on problems
when running multiple SCSI disks on each SCSI bus in the array.
A SCSI to SCSI bridge controller is used as an example.

Please note that SCSI is not the ideal choice for a RAID-3
product for a number of reasons.  The big plus that SCSI has is
its low cost, both of disk drives (price equal to ESDI) and
controller hardware (SCSI protocol chips are much cheaper than
disk sequencers).
.bp
The basic structure of a SCSI to SCSI RAID-3 bridge controller
is shown below.


        PORT 1      PORT 2      PORT 3      PORT 4      PORT 5

       +------+    +------+    +------+    +------+    +------+
RANK 3 | SCSI |    | SCSI |    | SCSI |    | SCSI |    | SCSI |
       | DISK |    | DISK |    | DISK |    | DISK |    | DISK |
       +------+    +------+    +------+    +------+    +------+
           |           |           |           |           |
       +------+    +------+    +------+    +------+    +------+
RANK 2 | SCSI |    | SCSI |    | SCSI |    | SCSI |    | SCSI |
       | DISK |    | DISK |    | DISK |    | DISK |    | DISK |
       +------+    +------+    +------+    +------+    +------+
           |           |           |           |           |
       +------+    +------+    +------+    +------+    +------+
RANK 1 | SCSI |    | SCSI |    | SCSI |    | SCSI |    | SCSI |
       | DISK |    | DISK |    | DISK |    | DISK |    | DISK |
       +------+    +------+    +------+    +------+    +------+
           |           |           |           |           |
           |           |           |           |           |
         +---------------------------------------------------+
         |                   SCSI PORTS AND                  |
         |             PARITY / BYTE SPLIT LOGIC             |
         +---------------------------------------------------+
                                   |
                                   |
                                +------+
                                | HOST |
                                | SCSI |
                                | PORT |
                                +------+

The host side is a SCSI port, probably wide or fast SCSI in order
to handle the data rate.

The device side consists of a number of SCSI ports (5 in this
example).  Most disk operations sent from the host are executed
by sending the same operation to one disk on each SCSI port.  In
this example each physical disk would read and write blocks which
are 1/4 the size of host blocks, with the fifth disk providing
parity.

Each device side SCSI port may have one or more drives attached
to it.  Each group of drives is termed a "rank" for the rest of
this paper.

The RAID-3 controller may be implemented in a number of ways. 
The least expensive involves minimal buffering on the controller. 
In this implementation data transfers from the host to disks by
selecting the corresponding disk on each SCSI bus then reading or
writing data.  The critical requirement is that the corresponding
disk from each SCSI port must be selected at the same time.

A RAID-3 controller will enforce several restrictions on the
disks which are attached in order to ease implementation.  These
are briefly:

1.  All disks in a given rank must be identical.  This means
    identical manufacturer and model.

2.  A given rank must be fully populated (4 or 5 disks) in order
    to operate at all.

3.  Drives within a single rank are normally synchronized using
    spindle sync.  The controller may also be able to operate
    without spindle sync.

Implementation decisions used in this example are:

1.  Only issue one command at a time to drives in each rank (no
    tagged commands).  This avoids deadlocks if the drives
    execute commands in a different order.

2.  Multiple drives may be active on a given port (overlapped
    seeks).

3.  Multiple ranks are treated as multiple logical units.

4.  Performance features such as readahead are assumed to be
    implemented by the drive.


PROBLEMS USING SCSI DISKS:

If one SCSI disk is placed on each SCSI port in the RAID-3
controller there are no major implementation problems.  However,
a typical requirement is to allow multiple disks attached to each
SCSI port, thus increasing capacity without requiring an extra
controller.

When multiple SCSI disks are attached several problems exist
which don't exist when using device level interfaces.  These
problems occur because the SCSI target device controls
reselection (while in device level interfaces the controller
handles all selection).  Thus it is possible for drives from
different ranks to be selected on each SCSI port.  Since the
RAID-3 device requires that drives from all ports be selected in
order to operate, a method is needed to avoid deadlock.

Potential deadlock cases are outlined below.  The examples here
assume that one port does something different, but the same
arguments apply to two or more ports.

1.  The controller attempts to select a rank but on one port a
    different drive reselects.

2.  Drives from one rank attempt to reselect with read data.  On
    one port a drive in a different rank reselects with data from
    a different command.

3.  Drives from one rank attempt to reselect to request write
    data.  On one port a drive in a different rank reselects for
    a different command.

The problem to be solved is how to get out of the deadlock
situation.



SOLUTIONS USING CURRENT SCSI (or very minor modifications):

The following are solutions which I can see using SCSI as
currently defined (proposed SCSI-2).  All provide a way for the
initiator to control selection using some algorithm to determine
which rank to service next.


A.  Direct Control Of Arbitration

An obvious solution is to directly control arbitration on all
ranks and only allow the drives of a given rank to win
arbitration if all drives are arbitrating at the same time.  I'm
not sure this can be implemented without custom hardware and
cabling (or perhaps not at all).  In general, an implementation
using off the shelf SCSI protocol chips is preferred.

This implementation also fails if a second controller is attached
to the SCSI bus (2 initiators for redundancy).


B.  Initiator DISCONNECT message

The initiator can send a disconnect message to drives which it
wants to wait until later.  This would be done by sending a
DISCONNECT to the target after receiving the IDENTIFY message if
controller isn't ready for the drive.

Problems with this solution include:

1.  The DISCONNECT message causes the drive to go away for 200us. 
    This could delay reselection briefly.

2.  A given drive might repeatedly reselect and be disconnected
    if waiting for a slow response from another rank.  This also
    adds to controller overhead.


C.  Pre Fetch command

Problems associated with selection and read operations may be
reduced by implementing reads as Pre Fetch (with disconnect)
followed by Read without disconnect (with data assumed to come
from target cache).  Then any reselection will be assumed not to
have any data associated with it (a Pre Fetch completing).  The
controller would select all drives to either issue a Pre Fetch
with disconnect or a read without disconnect.

Write operations could be handled through one of:

1.  Don't disconnect during writes.  Possibly preceed by a SEEK
    command to go to the location, however SEEK is currently a
    diagnostic operation.

2.  Don't disconnect between the CDB and write data transfer. 
    Only disconnect after all data is sent and waiting for
    completion.  This either means adding a requirement when
    selecting drives or adding a new DTDC value in the
    Disconnect / Reselect mode page.

Problems with this solution include:

1.  In general, two commands for every operation.  This adds
    significant overhead to the system.

2.  Long data operations (> drive buffer) must be done without
    disconnect.


D.  LARGE CONTROLLER BUFFER

Deadlocks from reselect aren't a problem if the controller has a
large enough buffer attached to each SCSI port.  Then it can
transfer data for each port independently, prefetching write data
and assembling read data when all is available.

Problems with this solution are:

1.  As a rule of thumb, the controller must have enough buffer to
    hold data for all outstanding commands.  If the controller is
    being used in 1MB reads and writes, the controller needs a 7MB
    buffer.  Since RAID-3 devices are used for high transfer rate,
    requests will tend to be long.

2.  If outstanding commands exceed buffer size, the controller
    must reduce drive activity to avoid deadlock.

3.  Buffers are very expensive.



Summary of solutions using current SCSI:

These solutions require one of the following strategies:

  - Custom hardware.
  - Initiator sent DISCONNECT
  - Never disconnect if a data phase may follow (basically ensure
    that all reselections are short and non-data).
  - Large buffers.

All of these solutions tend to reduce performance by either
increasing SCSI bus activity (lots of disconnects) or reducing
bus efficiency (by not disconnecting).


SOLUTIONS INVOLVING CHANGES TO SCSI:

The changes required must allow the initiator to control
reselection.  the changes are:

1.  Tell the target not to reselect.
    a.  Tell the target at initial selection not to reselect
        after disconnects.  This would be used if the initiator
        can determine when to have the target continue without
        information from the target.
    b.  On a reselection, to tell the target to go away until the
        initiator wants it to continue.
2.  Allow the initiator to select the target while a command is
    in progress (select without aborting current I/O process).

Some possible solutions are outlined here.


A.  NCR proposal

The NCR proposal adds two messages:

1.  DO NOT CALL ME (DNCM) to prevent target reselection.
2.  CONTINUE after a selection to tell the target that this is

These two messages perform the functions listed above.



B.  Modified DISCONNECT and selection

The DISCONNECT message could be extended to mean "disconnect and
don't reselect again".  This could be done via a new parameter in
the Disconnect / Reselect mode page.  This extension would handle
case 1b above.  Case 1a might be handled by a sequence like:

  a.  Target sends DISCONNECT message.
  b.  Initiator responds with ATN and DISCONNECT message.
  c.  Target goes to BUS FREE.

This solution is rather inelegent since it changes the semantics
of the initiator sent DISCONNECT message.

Case 2 above could be handled by one of:

  a.  Add a new form of selection at the low level (select with
      either MSG or C/D asserted) to identify a continued
      selection.  However, requires protocol chip changes.

  b.  Interpret a reselect from the initiator as a "continue I/O
      process".  Unfortunately, this reverses the initiator and
      target role at the low level, violating much of the
      philosophy of SCSI.

  c.  Have the initiator send a command (similar to AEN) telling
      the target to go ahead and reselect.  This is potentially
      high overhead.

  d.  Allow selection to an existing I/O process (perhaps only
      after initiator DISCONNECT message or under mode page
      control).  However, this may cause problems with RESET
      recovery.

  e.  Add a bit to IDENTIFY informing the target that this
      selection is continuing an I/O process.

  f.  Have the initiator select the existing I/O process then
      send an existing message which is re-interpreted as
      "continue". Candidates are DISCONNECT, INITIATE RECOVERY,
      or RESTORE POINTERS.

  d.  Add a message to continue an existing I/O process.




SUMMARY

This paper summarizes reasons for initiator controlled
reselection, specifically in the case of RAID-3 disk arrays.  It
summarizes the reasons for this feature, solutions possible using
currently defined SCSI, and some possible extensions to SCSI to
handle this situation.

None of these solutions is particularly good or compatible with
the "philosophy" of SCSI.  Unfortunately, the marketplace is more
concerned with cost and buzzwords.  SCSI is low cost and it's the
current "hot" technology.  SCSI products solving the above
problems will be marketed, the question is whether they will use
vendor specific hooks, suffer performance problems, or
unnecessarily complicate the product.