FCP-2 problem

Jim McGrath Jim.McGrath at quantum.com
Wed Jun 14 08:41:14 PDT 2000


* From the T10 Reflector (t10 at t10.org), posted by:
* Jim McGrath <Jim.McGrath at quantum.com>
*

I don't think your solution works.  Specifically, how does the target know
in this example that the initiator has reused OX-ID, since the target never
received the frame containing the associated command?  As far as it knows,
the REC ELS (which the initiator associates with the second, never received
command) is associated still with the first command.

Conversely, suppose the initiator sends the first command, the target
completes and sends back status (which is then dropped, so the initiator
never sees it), but then rather than sending the second command and then a
REC ELS the initiator just sends a REC ELS?  The target cannot tell the
difference between this and the first sequence of events.

Maybe my concern is disallowed by another aspect of the error recovery
protocol, or will be handled OK at a higher level of error recovery.  It
just seems dangerous to make assumptions when frames are being dropped and
so the recovery situation is very complicated.

This is why I always advocate a simple, brute force error recovery whenever
possible.  Specifically, the immediate reuse of OX_ID appears to be a very
bad policy to follow.  Normally a window of values is established precisely
to avoid these sorts of problems.  So another solution is to make sure you
do not reuse an OX_ID until you are really sure that this sort of problem
cannot occur.  In this case, using two values in alteration would work OK.
In general use N+1 values if you want N commands outstanding at a time (and
in practice me, being paranoid, would use N+m, where m is >1 to add some
extra margin for strange situations I am too stupid to foresee).

Jim




-----Original Message-----
From: Baldwin, Dave [mailto:Dave.Baldwin at emulex.com]
Sent: Tuesday, June 13, 2000 8:04 PM
To: Fibre Reflector; T10 Reflector
Cc: Robert Snively (Brocade)
Subject: FCP-2 problem


*
* From the fc reflector, posted by:
* "Baldwin, Dave" <Dave.Baldwin at emulex.com>
*
A serious hole in FCP-2 error recovery has been discovered. I would like
to solicit input on this issue from concerned parties. The problem can
occur in many forms with single or multi-LUN targets. Here is the basic
problem:

Initiator                                        Target

CMD ---------------------------->

1. A command (e.g. Test Unit Ready) is sent to the target with OX_ID =
1.

           <---------------------------     Response

2. A "good" response is sent back to the initiator. The initiator gets
the response and knows the TUR command has been completed, so the
exchange resources are freed. The target has sent the response, so it
saves the exchange information just in case the initiator needs to
recover a dropped response with REC/SRR.

CMD ---------------------------->  X (dropped frame)

3. A new command (e.g. SPACE forward 1 block) is sent to the target with
OX_ID = 1. This OX_ID reuse can occur for many reasons in various
systems. The command never makes it to the target because of a bit
error.

REC ------------------------------>

4. The initiator sends an REC ELS command to the target to make sure all
is well with OX_ID 1.

          <------------------------------   ACC

5. The target sends an ACC to the ELS saying that exchange 1 is complete
and the initiator has sequence initiative. Unfortunately, the target is
talking about the TUR command, while the initiator is talking about the
SPACE command.

SRR -------------------------------->

6. The initiator sends SRR to get the target to resend the response to
the SPACE command that it thinks has been dropped.

         <--------------------------------  ACC

7. The target says OK, I'll resend the response for the TUR command.

          <------------------------------- RSP

8. The target resends the TUR response. The initiator sees a "good"
response (it thinks for the SPACE command), and moves on to the next
command (maybe a WRITE).

The initiator can now write to the wrong block because it thinks the
tape has been properly positioned.

I have some preliminary thoughts on what might be done to solve this
issue, but none of them involve easy fixes. I was hoping someone might
come up with a simple solution. Any opinions?


I have a suggestion for improving a related FCP-2 behavior:

We need to guard against having several outstanding exchanges with the
same OX_ID from the target's point of view (20 tape drives, individual
LUNs within one target,  whose last exchange executed just happen to
have the same OX_ID within the timeout period). Otherwise, we have
recovery issues with REC/SRR because they are not LUN specific (yet
;-)).

I think a good solution is for the target to release all resources
associated with the old command with OX_ID = n (which the target
believes has been completed), when it gets a new OX_ID = n frame in with
a new command (R_CTL = 6). The reuse of the OX_ID by the initiator is a
confirmation that the old command has been completed. Since the target
and initiator both think the old exchange is complete, this should be
sufficient confirmation to get rid of the old information in the target.

Best regards,
Dave Baldwin
Emulex Corporation


*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org




More information about the T10 mailing list