3rd try on T10/97-155R1.TXT

Binford, Charles cbinford at ppdpost.ks.symbios.com
Wed Apr 2 14:51:00 PST 1997

* From the SCSI Reflector (scsi at symbios.com), posted by:
* "Binford, Charles" <cbinford at ppdpost.ks.symbios.com>

Doug,  I sprinkled a few comments below.  I marked them with ** in the first 

Charles Binford
Symbios Logic

>From: scsi-owner
>To: scsi_world; disk-attach-world
>Cc: hagerman
>Subject: 3rd try on T10/97-155R1.TXT
>Date: Thursday, March 27, 1997 3:35PM
>* From the SCSI Reflector (scsi at symbios.com), posted by:
>* "Doug Hagerman, Digital Equipment, 508-841-2145, Flames to /dev/null
>27-Mar-1997 1542" <hagerman at starch.ENET.dec.com>
>"More Discussion of Tapes in PLDA"                      T10/97-155r1
>1. Introduction
>Here is another attempt at putting together a "what to do about
>tapes on Fibre Channel" proposal. This version is modified based
>on discussion that took place during the SSC meeting during the
>March 1997 T10 week.
>I had wanted to keep the introductory material brief, but it became
>clear at the meeting that most of the confusion is in that material,
>so it's longer in this case.
>Note that there are really two configurations where this topic
>i. Native Fibre Channel tape drives.
>ii. Native Fibre Channel subsystem controllers (e.g. RAID controllers)
>that have the capability of having a tape drive behind them. This
>drive could be a regular SCSI tape drive. Fibre Channel needs to have
>a tape-oriented protocol on the FC connection to the host
>even if no native Fibre Channel tape drive is ever built.
>2. Overview of the Fibre Channel Tape Problem
>Tape devices have different performance requirements than disks.
>The special characteristics of tapes in an FC-AL environment are
>summarized as follows:
>a. If a tape command or data transfer fails on the interconnect, the
>recovery requires more than simply the reissuance of the command.
>The operating system driver software must manage the position of
>the media by issuing a sequence of repositioning commands in addition
>to reissuing the failed I/O command. This code is in SCSI tape
>drivers now, but the mechanical process required to complete the
>recovery may be time consuming.
>Note the distinction between "the application" (user's FORTRAN
>program) and "the driver" (operating system device driver). The user's
>program is not supposed to worry about repositioning after an
>interconnect data error.
>Also note that the driver has two parts: "the class driver" (knows
>about tapes, not interconnects) and "the port driver" (knows about
>interconnects, not tapes). A goal is to keep these 100% distinct.

** The above two paragraphs are good.  This is an area of confusion because 
we all have our own vocabulary when it comes to drivers and such.

>b. Using the SCSI command timeout to detect errors is generally 
>because the timeout value must be set to a large number (e.g. 10 minutes)
>to enable normal tape device operation. The timeout method may be
>acceptable if the error rate at the physical level is low enough
>so that the timeout is only excercised once or twice a day.

** I'd be more happy if the timeout was once or twice a *month*, not day.

>c. When devices are swapped on an FC-AL loop the loop signal is
>disrupted. It may not be possible to predict when this will occur,
>but in some environments many devices may be swapped in a day.
>d. The FC-AL loop may under normal conditions experience fairly
>frequent random bit errors. A normal parallel SCSI bus experiences
>errors at an extremely low rate--weeks may pass between parity errors.
>It is not known how frequently bit errors will occur on a normally
>operationg FC-AL loop. Worst-case calculations indicate that
>hardware complying with the standards may deliver an error bit
>every 10 seconds.
>One may argue what the delivered error rate will be. However, in
>order to minimize risk at the system level, the PLDA profile must
>protect against the worst case. The following is based on
>that assumption.
>A secondary goal is to avoid the introduction of Class 2 as a special
>case for tapes. This is particularly important in the case of
>subsystem controllers that must support both disk and tape device
>models. How is the driver to know whether to send a given
>INQUIRY command using Class 2 or Class 3? Must the driver handle
>INQUIRY commands differently from READ or WRITE commands?
>The best place to fix the tape problem is at the FCP level as
>described in PLDA. FC-PH and SCSI are long-established, and changes
>to SCSI driver software or FC-PH hardware are not desireable.
>Furthermore, it has already been agreed by the owner of FCP that FCP
>could be changed if a need can be demonstrated. Small changes to FCP
>and PLDA cause the minimum amount of disturbance to the status quo.
>3. Reliable Tape Transfers to Be Constrained in Size
>My previous contention was: It is widely agreed (not universally) that
>ALL tape transfers may be classified as one of:
>a. Transfers where data integrity is required, and where a maximum
>of 64kBytes will be transferred in any SCSI I/O command, or
>b. Transfers where bulk data is being moved and a data error should be
>ignored, and where the maximum transfer size may be greater than 64kB.
>This contention was rejected by the committee. Therefore any solution
>must handle the case of very long transfers done by a single SCSI
>4. Overview of Proposed Solution
>During the meeting the original proposal was modified so as to
>add, for READs, what amounts to an FCP-level acknowledgement for
>every sequence. This can be though of as an "FCP ACK 1".
>(In FC terminology, ACK 1 is "acknowledge receipt of one frame".
>ACK 0 is "acknowledge receipt of all frames of a sequence".
>ACK n is "acknowledge receipt of n frames".)

** Why not an "FCP ACK 0"?  I don't think the FCP layer should be aware of 
the frame reassembly into sequences.  I don't think it hurts your approach. 
 See additional comments below the READ case for more info.

>A new FCP information unit FCP_CONF is needed to send this
>acknowledgement or confirmation. This allows the initiator to request
>retransmission of data if a transfer fails, and does not involve the
>user's application program in the retransmission.
>Under this proposal, transfers would look like this:
>WRITE: Transfer of "n" DATA sequences. Each DATA below is one sequence.
>Initiator          Target
>FCP_CMD ---------->
>        <---------- FCP_XFR_RDY
>                        The target tells the host how much data it can
>                        accept before another FCP_XFR_RDY will be needed
>                        Say it's two sequences in this example
>DATA  a ---------->     This DATA sequence transferred successfully
>DATA  b ---------->     This DATA sequence transferred successfully
>        <---------- FCP_XFR_RDY
>DATA  c ---------->
>DATA  d -----X.....     Error occurs at "X"
>                        Error is detected by target using sequence count
>                        All further frames are ignored
>                        Target waits RA_TOV to age any pending frames
>        <---------- FCP_XFR_RDY
>                        With offset set back to "c"
>DATA  c ---------->
>DATA  d ---------->
>        <---------- FCP_XFR_RDY
>        .
>        .
>DATA  n ---------->
>        <---------- FCP_RSP
>                        With SCSI status
>                        Target closes exchange
>                        (Small exposure here to lost FCP_RSP frame)
>READ: Transfer of "n" sequences.
>Initiator          Target
>FCP_CMD ---------->
>                        Host can accept all the data specified in command
>        <---------- a  DATA sequence
>                        Target keeps the sequence data in its buffer
>                        until it receives the confirmation
>                        Each sequence is confirmed by host using FCP_CONF
>        <---------- b  DATA
>        <---------- c  DATA
>        .....X----- d  DATA
>                        Error occurs on loop at "X"
>                        Error is detected by initiator using sequence count
>                        All further frames are ignored
>                        Initiator waits RA_TOV to age any pending frames
>FCP_CONF---------->     Initiator sends "confirm" information unit
>                        Confirm IU requests retransmission of
>                        sequence containing DATA "d"
>        <---------- d  DATA
>                        Target retransmits the data from its buffer--no
>                        extra tape motion required
>        <---------- e  DATA
>        .
>        .
>        <---------- n  DATA
>                        Initiator sends "confirm" information unit
>                        Confirm IU notifies successful receipt of data
>        <---------- FCP_RSP
>                        With SCSI Status
>                        Target closes exchange
>                        Target flushes data buffer
>                        Small exposure here to lost FCP_RSP frame

**  Why not put the FCP_CONF on the FCP_RSP also??

>One may argue that this approach optimizes an error path at the cost
>of normal path performance. Whether this is worth it depends entirely
>on the expected rate of low-level errors in the system and the cost
>of managing those errors.
>Another possibility would be to have the drive tell the host how
>many sequences it can keep in its buffer. This would allow grouping
>multiple sequences together to reduce the number of acknowledgements
>while still allowing the drive to to any required retransmissions
>directly out of its buffer. It's more complicated, though...

**  If the FCP_CONF was an "FCP ACK 0", then the tape drive could choose 
sequence sizes based on its buffer requirements.  Instead of an FCP_CONF for 
every 2K, you may have an FCP_CONF for every 64K or 128K or 512K, it all 
depends on the size the target chooses to make the current sequence.  The 
target can change its sequence size dynamically bases on current buffer 
usage.  The host just knows that it should send an FCP_CONF each time a 
complete sequence arrives.

In other words, I don't think it it more complicated (as you suggest above), 
but rather I think it is less complicated.  If I wanted to, I could 
implement the target side "FCP ACK 0" with silicon I am aware of today, but 
not the "FCP ACK 1" solution.  FC silicon usually transmits an entire 

On the host side either method is a bit sticky.  Consider the SCSI Assist of 
Tachyon for example.  It doesn't tell the port driver any data has arrived 
until it all arrives (or a detectable error, e.g. FCP_STATUS before all of 
the data).  If I choose not to use the SCSI Assist I could do "FCP ACK 0", 
but I would probably have to copy the data internally to the application 
buffer (vs. direct DMA in the SCSI Assist case).  I see no way to implement 
"FCP ACK 1" with a Tachyon.

>5. Changes needed to PLDA and FCP
>The following pages and clauses of PLDA contain text that is
>relevant to this proposal. The proposal has a substantial impact
>on PLDA, particularly because the "disk-ness" of PLDA is
>implicit in much of the organization and text of the profile.
>page    clause          item
>----    ------          ----
>25      table 10        Data Overlay Allowed change to Required.
>26      8.2.1           Method of accounting for data must take into
>                        consideration data overlay.
>27      8.2.2           Use of FCP_CONF require for READ.
>27         Relative offset may be managed by Initiator.
>27         "
>33      9.1             ABTS is not to be invoked until after
>                        recovery has been attempted.
>33      9.3ff           Disk behavior must be separated from tape
>                        behavior.
>The following new material is needed in FCP.
>The addition of a new information unit, FCP_CONF, that is used
>to either confirm the successful transfer of READ data or to
>request the re-transmission of failed READ data.
>* For SCSI Reflector information, send a message with
>* 'info scsi' (no quotes) in the message body to majordomo at symbios.com
* For SCSI Reflector information, send a message with
* 'info scsi' (no quotes) in the message body to majordomo at symbios.com

More information about the T10 mailing list