FCL Error Recovery Policy Recommendations

Jim Mcgrath jmcgrath at QNTM.COM
Mon Sep 29 11:55:11 PDT 1997

* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by:
* jmcgrath at qntm.com (Jim Mcgrath)
     In some private communications the issue has been raised over whether
     we should separate error recovery for active exchanges from inactive
     (or closed) exchanges.  It is the latter that give rise to all of the
     concerns over resource usage.  For active exchanges, exchange status
     in some form is kept anyway.
     This would require a protocol to recover errors that occur when both
     end parties agree the exchange is active - like retransmitting data
     frames during data sequences - but not when one thinks it is closed -
     like status frames.
     Indeed, can we say something as simple as "in a queued environment,
     provide for sequence level recovery for command and data frames,
     but exchange level only recovery for status frames"?

______________________________ Reply Separator _________________________________
Subject: Re[2]: FCL Error Recovery Policy Recommendations
Author:  Jim Mcgrath at MIS2
Date:    9/25/97 3:09 PM

     Thanks for the clarification.  It would be great if we could
     get any response on this topic over the reflector before the T11 
     meeting.  I think it might represent a breakthrough that could 
     greatly simplify our work going forward.
______________________________ Reply Separator _________________________________
Subject: RE: FCL Error Recovery Policy Recommendations
Author:  "Binford; Charles" <cbinford at ppdpost.ks.symbios.com> at SMTP 
Date:    9/24/97 3:39 PM
See comments below.
Charles Binford
Symbios Logic
>From: t10-owner
>To: disk_attach; scsi; fc; jmcgrath
>Subject: FCL Error Recovery Policy Recommendations 
>Date: Wednesday, September 24, 1997 10:24AM
>* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by: 
>* jmcgrath at qntm.com (Jim Mcgrath)
.... stuff deleted .....
>     Second, we created a matrix for error recovery with two dimensions: 
>         the degree of threadedness between target and inititiator
>             (no queueing, so only one command outstanding at a time, 
>             and queueing, so multiple commands (and thus errors) are 
>             outstanding at a time), and
>         the level of error recovery
>             (sequence, which could mean data retransmission without 
>             respect to data sequence boundaries, and exchange, or
>             simply retrying the command): 
>     Are IO's queued?   Sequence or Exchange        Comment 
>                        Level Error recovery
>     ----------------------------------------------------------------- 
>     No                 Exchange              FC-AL today
>     No                 Sequence              Tape today - Crossroads 
>                                                  class 3 proposal
>     Yes                Exchange              Extend Crossroads and/or 
>                                                  Doug's class 2
>     Yes                Sequence              NEVER DO 
This chart may be a bit mis-leading to those who were not on the call. 
 Taken literally, one could come to the conclusion that the SSWG believes 
there is no queuing in FC-AL today.  That, of course, is not true.  Here is 
my interpretation of what we said plus some additional commentary:
     Are IO's queued?   Sequence or Exchange        Comment
                        Level Error recovery
1)   No                 Exchange              Simple case of # 3) below 
2)   No                 Sequence              Tape today - Crossroads
                                                  class 3 proposal
3)   Yes                Exchange              Early Error Detect possible 
4)   Yes                Sequence              NEVER DO
General Commentary:
(sorry if I got a little long winded, couldn't help myself :-) )
The backdrop for this entire discussion (at least for me) is developing a 
mechanism to detect and recover link related errors before a ULP timer 
expires.  This issue was brought into focus by the tape guys because of the 
length of the ULP timers required for tapes (minutes in some cases).  In 
addition to link error detection prior to ULP timer, tape has an additional 
need for sequence level recovery.  A simple retry of the IO (exchange) 
doesn't work with a sequential model.  The Crossroads proposal is aimed at 
solving both of these problems.  It works because tapes are (today) limited 
to a queue depth of 1.
At various times during the discussion of the Crossroads proposal, attempts 
have been made to generalize it so us disk and RAID guys could take 
advantage of it also.  Even though our ULP timeouts are generally measured 
in seconds instead of minutes, the need for link error detection and 
recovery before a ULP timeout is still there (at least for some).  However, 
the Crossroads proposal just didn't scale to a generalized queued 
environment.  However, what was discussed in SSWG call, was the fact that 
the reason the Crossroads proposal didn't scale to queued IO dealt with the 
sequence level recovery.  Targets are required to hold on to exchange 
information AFTER completion of the IO in case the FCP_RSP was lost. 
 Holding on to exchange info for 2xR_A_TOV is not possible for a disk/RAID 
device doing 5,000 to 10,000 IOs/sec.  But, if disk/RAID devices were going 
to do exchange level recovery anyway, then holding on the completed exchange 
info IS NOT NECESSARY.  For disk/RAID, the "did the FCP_CMD or FCP_RSP get 
lost" is a don't care. Just re-issue the IO.
Therefore, # 3) above is suggesting an error detection mechanism similar to 
the Crossroads proposal, but the error recovery side the same as in PLDA. 
 In other words, disk/RAID devices could implement the new REC link service. 
 If a host adapter/driver wanted to, it could time exchanges at an interval 
which would allow link error detection and recovery before a ULP timeout. 
 If this new, shorter timer expired, issue the REC.  If the drive answers, 
yes, I've still got the exchange, then wait.  If the answer is, don't know 
about that one, then ABTS, and re-issue the IO as in PLDA.
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at symbios.com

More information about the T10 mailing list