FCP-2 problem
Robert Snively
rsnively at Brocade.COM
Wed Jun 14 14:02:40 PDT 2000
* From the T10 Reflector (t10 at t10.org), posted by:
* Robert Snively <rsnively at Brocade.COM>
*
Proposed solutions:
BALDWIN:
> I think a good solution is for the target to release all resources
> associated with the old command with OX_ID = n (which the target
> believes has been completed), when it gets a new OX_ID = n
> frame in with
> a new command (R_CTL = 6). The reuse of the OX_ID by the
> initiator is a
> confirmation that the old command has been completed. Since
> the target
> and initiator both think the old exchange is complete, this should be
> sufficient confirmation to get rid of the old information in
> the target.
McGRATH:
>
> This is why I always advocate a simple, brute force error
> recovery whenever
> possible. Specifically, the immediate reuse of OX_ID
> appears to be a very
> bad policy to follow. Normally a window of values is
> established precisely
> to avoid these sorts of problems. So another solution is to
> make sure you
> do not reuse an OX_ID until you are really sure that this
> sort of problem
> cannot occur. In this case, using two values in alteration
> would work OK.
> In general use N+1 values if you want N commands outstanding
> at a time (and
> in practice me, being paranoid, would use N+m, where m is >1
> to add some
> extra margin for strange situations I am too stupid to foresee).
>
BINFORD:
> If my memory serves, SRR is FCP specific (i.e. an FC-4 link service,
> not a generic ELS). As such, it is reasonable to put FCP specific
> hooks in. The root problem is incorrect identification of the SCSI
> task to retransmit data on behalf of. It is being misinterpreted
> because of an alias of the task tag (i.e. the OX_ID). We could add
> Command Reference Number and LUN to the SRR request payload to avoid
> the alias problem. Of course this has two drawbacks:
> - requires use of CRN (otherwise not needed if single threaded I/O)
> - changes payload of SRR which has been stable for quite a while.
AND NOW SNIVELY:
While BS'ing about a similar problem, several things struck my
mind:
This problem requires class 3 behavior (that is probably a
good idea anyway, because the complexity of class 2 behavior
during errors, including full recovery qualifier discarding,
is pretty significant)
It requires OX_ID re-use to be relatively frequent compared with
RR_TOV (probably a bad idea anyway, as shown by Jim.)
Fortunately, OX_ID is qualified by D_ID/S_ID, so it is not
a resource scarce in bits and extended periods between reuse are
easy to achieve.
This can really only occur on operations without data transfers.
Writes trade RX_IDs during XFER_RDY. Reads trade RX_IDs during
read data transfer. So it is really only in the no data case that
you can get a case where an operation was successfully completed
without getting hit by an REC to perform a recovery before
a new command reuses the OX_ID.
As a side issue, it becomes a bit trickier if you have lots
of logical units. That increases the probability of
encountering a rapid enough turn-over of OX_IDs to
create a re-use before RR_TOV.
With this all in mind, one possible solution is to require FCP_CONF
on SCSI commands performing no data transfer in environments
performing link-level recovery with rapid OX_ID turn-over.
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org
More information about the T10
mailing list