FCP-2 problem
Baldwin, Dave
Dave.Baldwin at emulex.com
Wed Jun 14 14:08:50 PDT 2000
* From the T10 Reflector (t10 at t10.org), posted by:
* "Baldwin, Dave" <Dave.Baldwin at emulex.com>
*
Robert,
Just to make it perfectly clear, the solution you listed under my name is NOT
a proposed solution to my first issue! It is a secondary suggestion to resolve
some multi-LUN issues.
See my previous email for why FCP_CONF doesn't fix everything. Also, we are
sending more than 30,000 I/Os per second, so it is very easy to burn through
64k OX_IDs in RR_TOV!
Best regards,
Dave Baldwin
Robert Snively wrote:
> Proposed solutions:
>
> BALDWIN:
>
> > I think a good solution is for the target to release all resources
> > associated with the old command with OX_ID = n (which the target
> > believes has been completed), when it gets a new OX_ID = n
> > frame in with
> > a new command (R_CTL = 6). The reuse of the OX_ID by the
> > initiator is a
> > confirmation that the old command has been completed. Since
> > the target
> > and initiator both think the old exchange is complete, this should be
> > sufficient confirmation to get rid of the old information in
> > the target.
>
> McGRATH:
>
> >
> > This is why I always advocate a simple, brute force error
> > recovery whenever
> > possible. Specifically, the immediate reuse of OX_ID
> > appears to be a very
> > bad policy to follow. Normally a window of values is
> > established precisely
> > to avoid these sorts of problems. So another solution is to
> > make sure you
> > do not reuse an OX_ID until you are really sure that this
> > sort of problem
> > cannot occur. In this case, using two values in alteration
> > would work OK.
> > In general use N+1 values if you want N commands outstanding
> > at a time (and
> > in practice me, being paranoid, would use N+m, where m is >1
> > to add some
> > extra margin for strange situations I am too stupid to foresee).
> >
>
> BINFORD:
>
> > If my memory serves, SRR is FCP specific (i.e. an FC-4 link service,
> > not a generic ELS). As such, it is reasonable to put FCP specific
> > hooks in. The root problem is incorrect identification of the SCSI
> > task to retransmit data on behalf of. It is being misinterpreted
> > because of an alias of the task tag (i.e. the OX_ID). We could add
> > Command Reference Number and LUN to the SRR request payload to avoid
> > the alias problem. Of course this has two drawbacks:
>
> > - requires use of CRN (otherwise not needed if single threaded I/O)
> > - changes payload of SRR which has been stable for quite a while.
>
> AND NOW SNIVELY:
>
> While BS'ing about a similar problem, several things struck my
> mind:
>
> This problem requires class 3 behavior (that is probably a
> good idea anyway, because the complexity of class 2 behavior
> during errors, including full recovery qualifier discarding,
> is pretty significant)
>
> It requires OX_ID re-use to be relatively frequent compared with
> RR_TOV (probably a bad idea anyway, as shown by Jim.)
> Fortunately, OX_ID is qualified by D_ID/S_ID, so it is not
> a resource scarce in bits and extended periods between reuse are
> easy to achieve.
>
> This can really only occur on operations without data transfers.
> Writes trade RX_IDs during XFER_RDY. Reads trade RX_IDs during
> read data transfer. So it is really only in the no data case that
> you can get a case where an operation was successfully completed
> without getting hit by an REC to perform a recovery before
> a new command reuses the OX_ID.
>
> As a side issue, it becomes a bit trickier if you have lots
> of logical units. That increases the probability of
> encountering a rapid enough turn-over of OX_IDs to
> create a re-use before RR_TOV.
>
> With this all in mind, one possible solution is to require FCP_CONF
> on SCSI commands performing no data transfer in environments
> performing link-level recovery with rapid OX_ID turn-over.
>
>
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org
More information about the T10
mailing list