FCP-2 problem

Jim McGrath Jim.McGrath at quantum.com
Wed Jun 14 11:02:41 PDT 2000


* From the T10 Reflector (t10 at t10.org), posted by:
* Jim McGrath <Jim.McGrath at quantum.com>
*

Of course, one way is to limit the number of total outstanding commands to
everyone to one half of the maximum number supported by OX_ID (that way when
you wrap around you will get a different OX_ID - it is the "n+1" case I
cited earlier).  I know this "wastes" OX_IDs.  But while I forget how big
OX_ID is, unless we were very tight with the bits I think that approach
would give you enough IDs to keep enough stuff going for performance.

My other bias in these sorts of situations is to try and place the
corrective action on the party (initiator or target) that requests the
change - especially if the situation is one that only a few of devices (in
this case initiators) will ever see anyway. Indeed, I'd probably do the
above solution if I were in the designers shoes for the initiator, to make
sure the problem is fixed, and just see if there are any practical side
effects (like performance drop).  

Note that this bias has the benefit that no one has to convince anyone else
to do something to fix their problem - they can just fix it themselves.
That saves a lot of time in committee meetings :-).

Jim


-----Original Message-----
From: Baldwin, Dave [mailto:Dave.Baldwin at emulex.com]
Sent: Wednesday, June 14, 2000 10:42 AM
To: Jim McGrath
Cc: Binford, Charles; Fibre Reflector; T10 Reflector
Subject: Re: FCP-2 problem


Jim,

I'm sorry I confused you by putting two issues in the same email. I wasn't
proposing a solution to the first problem, only asking for input.

I had already considered an OX_ID reuse scheme, and a scheme to use CRN with
a modified REC and SRR payload. I suspect both solutions have vendors that
can't implement these behaviors.

The OX_ID reuse is not easily controlled. From the initiator's perspective,
it is not immediately reusing the OX_ID. You send OX_ID =1  to a tape
device,
then send thousands of commands to other disk devices on the network, and
when OX_ID =1 comes up for reuse it gets sent to the tape device causing the
problem. So, you would need to keep track of OX_ID use on a per LUN basis
and
put other target restrictions in (for multi-LUN devices) like I suggested in
the second part of my email. Doing the "n + 1" reuse policy is even uglier
|from the initiator's perspective, but I can see why a target implementation
would vote for this solution (no work to do!). I don't think the performance
degradation in the initiator would be acceptable.

The CRN solution seems better to me, but requires changing the REC and SRR
ELS commands that have been implemented for awhile. It requires driver,
firmware, and in some cases hardware changes to implement. Identifying the
exact command to perform recovery on seems very important to me.

Does anyone see a simpler solution?

Best regards,
Dave Baldwin
Emulex Corporation

"Binford, Charles" wrote:

> *
> * From the fc reflector, posted by:
> * "Binford, Charles" <cbinford at lsil.com>
> *
> I agree with Jim that immediate OX_ID reuse by the initiator is bad.
> However, as Dave said in his original posting, the OX_ID  may have been
> reused for a wide variety of reasons.  I'd support the 'must use n+1
> OX_IDs'
> solution, but let me through out another possibility in case others
> object
> to the OX_ID restriction.
>
> If my memory serves, SRR is FCP specific (i.e. an FC-4 link service, not
> a
> generic ELS).  As such, it is reasonable to put FCP specific hooks in.
> The
> root problem is incorrect identification of the SCSI task to retransmit
> data
> on behalf of.  It is being misinterpreted because of an alias of the
> task
> tag (i.e. the OX_ID).  We could add Command Reference Number and LUN to
> the
> SRR request payload to avoid the alias problem.  Of course this has two
> drawbacks:
> - requires use of CRN (otherwise not needed if single threaded I/O)
> - changes payload of SRR which has been stable for quite a while.
>
> Charles Binford
> LSI Logic Storage Systems
> (316) 636-8566
>
> -----Original Message-----
> From: Jim McGrath [mailto:Jim.McGrath at quantum.com]
> Sent: Wednesday, June 14, 2000 10:41 AM
> To: 'Baldwin, Dave'; Fibre Reflector; T10 Reflector
> Cc: Robert Snively (Brocade)
> Subject: RE: FCP-2 problem
>
> * From the T10 Reflector (t10 at t10.org), posted by:
> * Jim McGrath <Jim.McGrath at quantum.com>
> *
>
> I don't think your solution works.  Specifically, how does the target
> know
> in this example that the initiator has reused OX-ID, since the target
> never
> received the frame containing the associated command?  As far as it
> knows,
> the REC ELS (which the initiator associates with the second, never
> received
> command) is associated still with the first command.
>
> Conversely, suppose the initiator sends the first command, the target
> completes and sends back status (which is then dropped, so the initiator
> never sees it), but then rather than sending the second command and then
> a
> REC ELS the initiator just sends a REC ELS?  The target cannot tell the
> difference between this and the first sequence of events.
>
> Maybe my concern is disallowed by another aspect of the error recovery
> protocol, or will be handled OK at a higher level of error recovery.  It
> just seems dangerous to make assumptions when frames are being dropped
> and
> so the recovery situation is very complicated.
>
> This is why I always advocate a simple, brute force error recovery
> whenever
> possible.  Specifically, the immediate reuse of OX_ID appears to be a
> very
> bad policy to follow.  Normally a window of values is established
> precisely
> to avoid these sorts of problems.  So another solution is to make sure
> you
> do not reuse an OX_ID until you are really sure that this sort of
> problem
> cannot occur.  In this case, using two values in alteration would work
> OK.
> In general use N+1 values if you want N commands outstanding at a time
> (and
> in practice me, being paranoid, would use N+m, where m is >1 to add some
> extra margin for strange situations I am too stupid to foresee).
>
> Jim
>
> ------_=_NextPart_001_01BFD621.824EB7C8
> Content-Type: text/html;
>         charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

> > charset=3Diso-8859-1"> > 5.5.2650.12"> > RE: FCP-2 problem > > > > I agree with Jim that immediate OX_ID reuse by the = > initiator is bad.  However, as Dave said in his original posting, = > the OX_ID  may have been reused for a wide variety of = > reasons.  I'd support the 'must use n+1 OX_IDs' solution, but let = > me through out another possibility in case others object to the OX_ID = > restriction. > > If my memory serves, SRR is FCP specific (i.e. an = > FC-4 link service, not a generic ELS).  As such, it is reasonable = > to put FCP specific hooks in.  The root problem is incorrect = > identification of the SCSI task to retransmit data on behalf of.  = > It is being misinterpreted because of an alias of the task tag (i.e. = > the OX_ID).  We could add Command Reference Number and LUN to the = > SRR request payload to avoid the alias problem.  Of course this = > has two drawbacks: > > - requires use of CRN (otherwise not needed if single > = > threaded I/O) > 
- changes payload of SRR which has been stable for = > quite a while. > > > Charles Binford > 
LSI Logic Storage Systems > 
(316) 636-8566 > > 
> > -----Original Message----- > 
From: Jim McGrath [ HREF=3D"mailto:Jim.McGrath at quantum.com">mailto:Jim.McGrath at quantum.com A>] > 
Sent: Wednesday, June 14, 2000 10:41 AM > 
To: 'Baldwin, Dave'; Fibre Reflector; T10 = > Reflector > 
Cc: Robert Snively (Brocade) > 
Subject: RE: FCP-2 problem > > 
> > * From the T10 Reflector (t10 at t10.org), posted = > by: > 
* Jim McGrath <Jim.McGrath at quantum.com> > 
* > > > I don't think your solution works.  = > Specifically, how does the target know > 
in this example that the initiator has reused OX-ID, > = > since the target never > 
received the frame containing the associated = > command?  As far as it knows, > 
the REC ELS (which the initiator associates with the > = > second, never received > 
command) is associated still with the first = > command. > > > Conversely, suppose the initiator sends the first = > command, the target > 
completes and sends back status (which is then = > dropped, so the initiator > 
never sees it), but then rather than sending the = > second command and then a > 
REC ELS the initiator just sends a REC ELS?  = > The target cannot tell the > 
difference between this and the first sequence of = > events. > > > Maybe my concern is disallowed by another aspect of = > the error recovery > 
protocol, or will be handled OK at a higher level of > = > error recovery.  It > 
just seems dangerous to make assumptions when frames > = > are being dropped and > 
so the recovery situation is very = > complicated. > > > This is why I always advocate a simple, brute force = > error recovery whenever > 
possible.  Specifically, the immediate reuse of > = > OX_ID appears to be a very > 
bad policy to follow.  Normally a window of = > values is established precisely > 
to avoid these sorts of problems.  So another = > solution is to make sure you > 
do not reuse an OX_ID until you are really sure that > = > this sort of problem > 
cannot occur.  In this case, using two values = > in alteration would work OK. > 
In general use N+1 values if you want N commands = > outstanding at a time (and > 
in practice me, being paranoid, would use N+m, where > = > m is >1 to add some > 
extra margin for strange situations I am too stupid = > to foresee). > > > Jim > > > > 
> ------_=_NextPart_001_01BFD621.824EB7C8--
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org




More information about the T10 mailing list