FCP-2 problem

Baldwin, Dave Dave.Baldwin at emulex.com
Tue Jun 13 20:03:36 PDT 2000


* From the T10 Reflector (t10 at t10.org), posted by:
* "Baldwin, Dave" <Dave.Baldwin at emulex.com>
*
A serious hole in FCP-2 error recovery has been discovered. I would like
to solicit input on this issue from concerned parties. The problem can
occur in many forms with single or multi-LUN targets. Here is the basic
problem:

Initiator                                        Target

CMD ---------------------------->

1. A command (e.g. Test Unit Ready) is sent to the target with OX_ID =
1.

            <---------------------------     Response

2. A "good" response is sent back to the initiator. The initiator gets
the response and knows the TUR command has been completed, so the
exchange resources are freed. The target has sent the response, so it
saves the exchange information just in case the initiator needs to
recover a dropped response with REC/SRR.

CMD ---------------------------->  X (dropped frame)

3. A new command (e.g. SPACE forward 1 block) is sent to the target with
OX_ID = 1. This OX_ID reuse can occur for many reasons in various
systems. The command never makes it to the target because of a bit
error.

REC ------------------------------>

4. The initiator sends an REC ELS command to the target to make sure all
is well with OX_ID 1.

           <------------------------------   ACC

5. The target sends an ACC to the ELS saying that exchange 1 is complete
and the initiator has sequence initiative. Unfortunately, the target is
talking about the TUR command, while the initiator is talking about the
SPACE command.

SRR -------------------------------->

6. The initiator sends SRR to get the target to resend the response to
the SPACE command that it thinks has been dropped.

          <--------------------------------  ACC

7. The target says OK, I'll resend the response for the TUR command.

           <------------------------------- RSP

8. The target resends the TUR response. The initiator sees a "good"
response (it thinks for the SPACE command), and moves on to the next
command (maybe a WRITE).

The initiator can now write to the wrong block because it thinks the
tape has been properly positioned.

I have some preliminary thoughts on what might be done to solve this
issue, but none of them involve easy fixes. I was hoping someone might
come up with a simple solution. Any opinions?


I have a suggestion for improving a related FCP-2 behavior:

We need to guard against having several outstanding exchanges with the
same OX_ID from the target's point of view (20 tape drives, individual
LUNs within one target,  whose last exchange executed just happen to
have the same OX_ID within the timeout period). Otherwise, we have
recovery issues with REC/SRR because they are not LUN specific (yet
;-)).

I think a good solution is for the target to release all resources
associated with the old command with OX_ID = n (which the target
believes has been completed), when it gets a new OX_ID = n frame in with
a new command (R_CTL = 6). The reuse of the OX_ID by the initiator is a
confirmation that the old command has been completed. Since the target
and initiator both think the old exchange is complete, this should be
sufficient confirmation to get rid of the old information in the target.

Best regards,
Dave Baldwin
Emulex Corporation



*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org




More information about the T10 mailing list