[Fwd: FCP-2: Lost FCP_CMND, Unacknowledged classes.]
RogerR at exabyte.com
RogerR at exabyte.com
Thu Jul 11 10:07:20 PDT 2002
* From the T10 Reflector (t10 at t10.org), posted by:
* RogerR at exabyte.com
In addition to what Dave Peterson already answered, I might provide a little
help with the tape case. This probably isn't the answer you would prefer to
hear, but it is what we have to work with these days. (BTW. I'll retain the
T10 reflector on this reply, but I don't subscribe to that mailing list and
won't see replies limited to that side.)
Currently one way the two issues you describe below are distinguished is to
issue a READ POSITION and compare the results to what is expected based upon
the command. The problem with this is that somebody on the initiator side
of things needs to keep track of what the position is. Whoever is keeping
track of the current location pretty much needs to have exclusive control of
the tape device. This means not only device reservations, but reservations
on the logical representation of the device within the host (e.g. an
"exclusive open" from the OS layer).
The weird catch is that even if the two conditions you describe could be
distinguished, some cases where FCP_RSP is lost would still require a READ
POSITION and initiator knowledge of the expected location (or some other way
of determining/establishing position). The reason for this is that some
tape commands can be partially completed, and the information on how much of
the command actually completed is in the lost FCP_RSP. (A facility to
request the status/sense to be resent would helpful here, but properly
implementing such a facility probably cascades up into the SCSI SAM layer
and might get costly to implement; unless, the number of commands to retain
can be significantly limited.)
It's not unusual for a lot of this tape error recovery stuff to bubble up to
the higher application levels, simply because it avoids tracking the state
information within the driver. Tracking in the driver is a bit more
complicated, since the driver has to handle the general case; whereas, an
application only has to track the state information it cares about.
(Application-specific drivers can help here, but application-specific
drivers frequently break other applications by changing the driver semantics
and refusing to get out the way when other applications are using the
Another common method for recovery on tape is a checkpoint/restart
mechanism. Every once in a while, the application requests unbuffered
filemarks or setmarks, then saves away checkpoint state information.
Whenever a failure occurs, the application backs up and recovers from the
last successful checkpoint. This is almost invariably implemented at
application level, since the application state information is needed for
restarting. (Some OS's provide tools to assist checkpointing.)
As I said, these probably aren't the answers you want, but it's kind'a where
things are today.
PS. I'll apologize now if I messed up the quoting levels below, but I
wanted to trim out the stuff I wasn't replying to, and I was actually
working off of Dave's reply email.
PPS. Due to buffering issues, the case you describe with the lost FCP_RSP is
a potential problem even if the command does involve data transfers.
> Santosh Rao wrote:
> Issue 2
> How should the initiator differentiate b/n a FCP_RJT from a target due
> to a lost FCP_CMD (the scenario described in Annexe C Fig C.2) and the
> case where the target has discarded exchange state due to the expiration
> of RR_TOV after sending FCP_RSP.
> In both the above cases, our interpretation of FCP-2 is that the
> initiator will see a FCP_RJT response to the REC with :
> rsn_code = "Logical Error" or "Unable to perform command request"
> rsn_expln = "Invalid OXID-RXID combination"
> In this case, the initiator cannot apply the same error recovery for the
> 2 cases. In the lost FCP_CMND case, the initiator may safely re-issue
> the command. The latter case could occur in the following manner :
> - Initiator issues a command which does not involve data xfer.
> - Target sends FCP_RSP, FCP_RSP is lost.
> - Initiator REC_TOV timer pops and initiators sends REC.
> - REC times out after RA_TOVels (which is > RR_TOV, for fabric)
> - Initiator aborts REC and issues another REC
> - Target sends FCP_RJT response since it has discarded the exchange
> In the above case, the initiator MUST NOT re-issue the FCP_CMND, since
> this can potentially cause a data corruption with tape devices. (ex :
> re-issuing a scsi command like SPACE, WRITE FILEMARKS when they had
> previously been executed successfully can cause tape data corruption.)
> Can someone clarify on how FCP-2 differentiates these 2 cases? Without
> the ability to differentiate between these 2 cases, the use of SLER in a
> lost FCP_CMND scenario can result in potential data corruption with tape
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org
More information about the T10