[T11.3] Re: [Fwd: FCP-2: Lost FCP_CMND, Unacknowledged classes.]

Santosh Rao santoshr at cup.hp.com
Thu Jul 11 11:09:15 PDT 2002


INCITS T11.3 Mail Reflector
********************************
Roger,

Thanks for your response. A couple of comments below.

> PPS. Due to buffering issues, the case you describe with the lost
> FCP_RSP is a potential problem even if the command does involve data
> transfers.

The reason I did'nt bring up the above case is because it is usually
possible for the initiator HBA driver to track (based on its scsi
exchange state) whether some bytes were transferred as a part of this
exchange for those I/Os that involve a data transfer. Thus, a lost
FCP_RSP for an exchange that involved data transfer can be distingushed
|from a lost FCP_CMD case based on the fact that the scsi exchange state
within the driver indicates some bytes were transferred.

However, if all the FCP_DATA IUs and the FCP_RSP IU was lost, then, the
below scenario discussed does apply.

> The weird catch is that even if the two conditions you describe could
> be distinguished, some cases where FCP_RSP is lost would still require
> a READ POSITION and initiator knowledge of the expected location (or
> some other way of determining/establishing position).  The reason for
> this is that some tape commands can be partially completed, and the
> information on how much of the command actually completed is in the
> lost FCP_RSP. 

If we could distingush b/n a lost FCP_CMD and a lost FCP_RSP (by
increasing RR_TOV or using another timer for the discard of exchange
state, such that the target would not discard state information), then,
the REC response should indicate the exchange state giving equivalent
information as conveyed in FCP_RSP.

In the case where FCP-2 SLER is unsuccessful, the driver returns an
error on that I/O to the SCSI ULP and the error is propagated to the
tape application which may choose to issue READ POSITION or other
commands to determine the state of the tape device.

Thanks,
Santosh


> RogerR at exabyte.com wrote:
> 
> In addition to what Dave Peterson already answered, I might provide a
> little help with the tape case.  This probably isn't the answer you
> would prefer to hear, but it is what we have to work with these days.
> (BTW. I'll retain the T10 reflector on this reply, but I don't
> subscribe to that mailing list and won't see replies limited to that
> side.)
> 
> Currently one way the two issues you describe below are distinguished
> is to issue a READ POSITION and compare the results to what is
> expected based upon the command.  The problem with this is that
> somebody on the initiator side of things needs to keep track of what
> the position is.  Whoever is keeping track of the current location
> pretty much needs to have exclusive control of the tape device.  This
> means not only device reservations, but reservations on the logical
> representation of the device within the host (e.g. an "exclusive open"
> from the OS layer).
> 
> The weird catch is that even if the two conditions you describe could
> be distinguished, some cases where FCP_RSP is lost would still require
> a READ POSITION and initiator knowledge of the expected location (or
> some other way of determining/establishing position).  The reason for
> this is that some tape commands can be partially completed, and the
> information on how much of the command actually completed is in the
> lost FCP_RSP.  (A facility to request the status/sense to be resent
> would helpful here, but properly implementing such a facility probably
> cascades up into the SCSI SAM layer and might get costly to implement;
> unless, the number of commands to retain can be significantly
> limited.)
> 
> It's not unusual for a lot of this tape error recovery stuff to bubble
> up to the higher application levels, simply because it avoids tracking
> the state information within the driver.  Tracking in the driver is a
> bit more complicated, since the driver has to handle the general case;
> whereas, an application only has to track the state information it
> cares about.  (Application-specific drivers can help here, but
> application-specific drivers frequently break other applications by
> changing the driver semantics and refusing to get out the way when
> other applications are using the device.)
> 
> Another common method for recovery on tape is a checkpoint/restart
> mechanism.  Every once in a while, the application requests unbuffered
> filemarks or setmarks, then saves away checkpoint state information.
> Whenever a failure occurs, the application backs up and recovers from
> the last successful checkpoint.  This is almost invariably implemented
> at application level, since the application state information is
> needed for restarting.  (Some OS's provide tools to assist
> checkpointing.)
> 
> As I said, these probably aren't the answers you want, but it's kind'a
> where things are today.
> 
> -roger
> 
> PS.  I'll apologize now if I messed up the quoting levels below, but I
> wanted to trim out the stuff I wasn't replying to, and I was actually
> working off of Dave's reply email.
> 
> PPS. Due to buffering issues, the case you describe with the lost
> FCP_RSP is a potential problem even if the command does involve data
> transfers.
> 
> > Santosh Rao wrote:
> ...
> > Issue 2
> > =======
> > How should the initiator differentiate b/n a FCP_RJT from a target
> due
> > to a lost FCP_CMD (the scenario described in Annexe C Fig C.2) and
> the
> > case where the target has discarded exchange state due to the
> expiration
> > of RR_TOV after sending FCP_RSP.
> >
> > In both the above cases, our interpretation of FCP-2 is that the
> > initiator will see a FCP_RJT response to the REC with :
> > rsn_code = "Logical Error" or "Unable to perform command request"
> > rsn_expln = "Invalid OXID-RXID combination"
> >
> > In this case, the initiator cannot apply the same error recovery for
> the
> > 2 cases. In the lost FCP_CMND case, the initiator may safely
> re-issue
> > the command. The latter case could occur in the following manner :
> >
> > - Initiator issues a command which does not involve data xfer.
> > - Target sends FCP_RSP, FCP_RSP is lost.
> > - Initiator REC_TOV timer pops and initiators sends REC.
> > - REC times out after RA_TOVels (which is > RR_TOV, for fabric)
> > - Initiator aborts REC and issues another REC
> > - Target sends FCP_RJT response since it has discarded the exchange
> > state.
> >
> > In the above case, the initiator MUST NOT re-issue the FCP_CMND,
> since
> > this can potentially cause a data corruption with tape devices. (ex
> :
> > re-issuing a scsi command like SPACE, WRITE FILEMARKS when they had
> > previously been executed successfully can cause tape data
> corruption.)
> >
> > Can someone clarify on how FCP-2 differentiates these 2 cases?
> Without
> > the ability to differentiate between these 2 cases, the use of SLER
> in a
> > lost FCP_CMND scenario can result in potential data corruption with
> tape
> > devices.

-- 
Education is when you read the fine print. 
Experience is what you get if you don't.

To Unsubscribe:
mailto:t11_3-request at mail.t11.org?subject=unsubscribe





More information about the T10 mailing list