FCP-2 problem

Robert Snively rsnively at Brocade.COM
Wed Jun 14 14:02:40 PDT 2000


* From the T10 Reflector (t10 at t10.org), posted by:
* Robert Snively <rsnively at Brocade.COM>
*
Proposed solutions:

BALDWIN:

>  I think a good solution is for the target to release all resources
>  associated with the old command with OX_ID = n (which the target
>  believes has been completed), when it gets a new OX_ID = n 
>  frame in with
>  a new command (R_CTL = 6). The reuse of the OX_ID by the 
>  initiator is a
>  confirmation that the old command has been completed. Since 
>  the target
>  and initiator both think the old exchange is complete, this should be
>  sufficient confirmation to get rid of the old information in 
>  the target.

McGRATH:

>  
>  This is why I always advocate a simple, brute force error 
>  recovery whenever
>  possible.  Specifically, the immediate reuse of OX_ID 
>  appears to be a very
>  bad policy to follow.  Normally a window of values is 
>  established precisely
>  to avoid these sorts of problems.  So another solution is to 
>  make sure you
>  do not reuse an OX_ID until you are really sure that this 
>  sort of problem
>  cannot occur.  In this case, using two values in alteration 
>  would work OK.
>  In general use N+1 values if you want N commands outstanding 
>  at a time (and
>  in practice me, being paranoid, would use N+m, where m is >1 
>  to add some
>  extra margin for strange situations I am too stupid to foresee).
>  

BINFORD:

> If my memory serves, SRR is FCP specific (i.e. an FC-4 link service, 
> not a generic ELS).  As such, it is reasonable to put FCP specific  
> hooks in.  The root problem is incorrect identification of the SCSI  
> task to retransmit data on behalf of.  It is being misinterpreted  
> because of an alias of the task tag (i.e. the OX_ID).  We could add  
> Command Reference Number and LUN to the SRR request payload to avoid  
> the alias problem.  Of course this has two drawbacks:

>   - requires use of CRN (otherwise not needed if single threaded I/O) 
>   - changes payload of SRR which has been stable for quite a while. 


AND NOW SNIVELY:

While BS'ing about a similar problem, several things struck my
mind:

	This problem requires class 3 behavior (that is probably a
	good idea anyway, because the complexity of class 2 behavior
	during errors, including full recovery qualifier discarding,
	is pretty significant)

	It requires OX_ID re-use to be relatively frequent compared with
	RR_TOV (probably a bad idea anyway, as shown by Jim.)  
	Fortunately, OX_ID is qualified by D_ID/S_ID, so it is not
	a resource scarce in bits and extended periods between reuse are
	easy to achieve.

	This can really only occur on operations without data transfers.
	Writes trade RX_IDs during XFER_RDY.  Reads trade RX_IDs during
	read data transfer.  So it is really only in the no data case that
	you can get a case where an operation was successfully completed
	without getting hit by an REC to perform a recovery before 
	a new command reuses the OX_ID.

	As a side issue, it becomes a bit trickier if you have lots
	of logical units.  That increases the probability of
	encountering a rapid enough turn-over of OX_IDs to 
	create a re-use before RR_TOV.  

With this all in mind, one possible solution is to require FCP_CONF
on SCSI commands performing no data transfer in environments 
performing link-level recovery with rapid OX_ID turn-over.

	
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org




More information about the T10 mailing list