[T11.3] Re: [Fwd: FCP-2: Lost FCP_CMND, Unacknowledged classes.]

RogerR at exabyte.com RogerR at exabyte.com
Thu Jul 11 10:07:20 PDT 2002


INCITS T11.3 Mail Reflector
********************************
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_001_01C228FD.6AB021C0
Content-Type: text/plain;
	charset="iso-8859-1"

In addition to what Dave Peterson already answered, I might provide a
little help with the tape case.  This probably isn't the answer you
would prefer to hear, but it is what we have to work with these days.
(BTW. I'll retain the T10 reflector on this reply, but I don't subscribe
to that mailing list and won't see replies limited to that side.)

Currently one way the two issues you describe below are distinguished is
to issue a READ POSITION and compare the results to what is expected
based upon the command.  The problem with this is that somebody on the
initiator side of things needs to keep track of what the position is.
Whoever is keeping track of the current location pretty much needs to
have exclusive control of the tape device.  This means not only device
reservations, but reservations on the logical representation of the
device within the host (e.g. an "exclusive open" from the OS layer).

The weird catch is that even if the two conditions you describe could be
distinguished, some cases where FCP_RSP is lost would still require a
READ POSITION and initiator knowledge of the expected location (or some
other way of determining/establishing position).  The reason for this is
that some tape commands can be partially completed, and the information
on how much of the command actually completed is in the lost FCP_RSP.
(A facility to request the status/sense to be resent would helpful here,
but properly implementing such a facility probably cascades up into the
SCSI SAM layer and might get costly to implement; unless, the number of
commands to retain can be significantly limited.)

It's not unusual for a lot of this tape error recovery stuff to bubble
up to the higher application levels, simply because it avoids tracking
the state information within the driver.  Tracking in the driver is a
bit more complicated, since the driver has to handle the general case;
whereas, an application only has to track the state information it cares
about.  (Application-specific drivers can help here, but
application-specific drivers frequently break other applications by
changing the driver semantics and refusing to get out the way when other
applications are using the device.)

Another common method for recovery on tape is a checkpoint/restart
mechanism.  Every once in a while, the application requests unbuffered
filemarks or setmarks, then saves away checkpoint state information.
Whenever a failure occurs, the application backs up and recovers from
the last successful checkpoint.  This is almost invariably implemented
at application level, since the application state information is needed
for restarting.  (Some OS's provide tools to assist checkpointing.)

As I said, these probably aren't the answers you want, but it's kind'a
where things are today. 

-roger 

PS.  I'll apologize now if I messed up the quoting levels below, but I
wanted to trim out the stuff I wasn't replying to, and I was actually
working off of Dave's reply email.

PPS. Due to buffering issues, the case you describe with the lost
FCP_RSP is a potential problem even if the command does involve data
transfers.


> Santosh Rao wrote: 
... 
> Issue 2 
> ======= 
> How should the initiator differentiate b/n a FCP_RJT from a target due

> to a lost FCP_CMD (the scenario described in Annexe C Fig C.2) and the

> case where the target has discarded exchange state due to the
expiration 
> of RR_TOV after sending FCP_RSP. 
> 
> In both the above cases, our interpretation of FCP-2 is that the 
> initiator will see a FCP_RJT response to the REC with : 
> rsn_code = "Logical Error" or "Unable to perform command request" 
> rsn_expln = "Invalid OXID-RXID combination" 
> 
> In this case, the initiator cannot apply the same error recovery for
the 
> 2 cases. In the lost FCP_CMND case, the initiator may safely re-issue 
> the command. The latter case could occur in the following manner : 
> 
> - Initiator issues a command which does not involve data xfer. 
> - Target sends FCP_RSP, FCP_RSP is lost. 
> - Initiator REC_TOV timer pops and initiators sends REC. 
> - REC times out after RA_TOVels (which is > RR_TOV, for fabric) 
> - Initiator aborts REC and issues another REC 
> - Target sends FCP_RJT response since it has discarded the exchange 
> state. 
> 
> In the above case, the initiator MUST NOT re-issue the FCP_CMND, since

> this can potentially cause a data corruption with tape devices. (ex : 
> re-issuing a scsi command like SPACE, WRITE FILEMARKS when they had 
> previously been executed successfully can cause tape data corruption.)

> 
> Can someone clarify on how FCP-2 differentiates these 2 cases? Without

> the ability to differentiate between these 2 cases, the use of SLER in
a 
> lost FCP_CMND scenario can result in potential data corruption with
tape 
> devices. 


------_=_NextPart_001_01C228FD.6AB021C0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset="iso-8859-1"

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

RE: [Fwd: FCP-2: Lost FCP_CMND, Unacknowledged = classes.] In addition to what Dave Peterson already answered, I = might provide a little help with the tape case.  This probably = isn't the answer you would prefer to hear, but it is what we have to = work with these days.  (BTW. I'll retain the T10 reflector on this = reply, but I don't subscribe to that mailing list and won't see replies = limited to that side.) Currently one way the two issues you describe below = are distinguished is to issue a READ POSITION and compare the results = to what is expected based upon the command.  The problem with this = is that somebody on the initiator side of things needs to keep track of = what the position is.  Whoever is keeping track of the current = location pretty much needs to have exclusive control of the tape = device.  This means not only device reservations, but reservations = on the logical representation of the device within the host (e.g. an = ;exclusive open; from the OS layer). The weird catch is that even if the two conditions = you describe could be distinguished, some cases where FCP_RSP is lost = would still require a READ POSITION and initiator knowledge of the = expected location (or some other way of determining/establishing = position).  The reason for this is that some tape commands can be = partially completed, and the information on how much of the command = actually completed is in the lost FCP_RSP.  (A facility to request = the status/sense to be resent would helpful here, but properly = implementing such a facility probably cascades up into the SCSI SAM = layer and might get costly to implement; unless, the number of commands = to retain can be significantly limited.) It's not unusual for a lot of this tape error = recovery stuff to bubble up to the higher application levels, simply = because it avoids tracking the state information within the = driver.  Tracking in the driver is a bit more complicated, since = the driver has to handle the general case; whereas, an application only = has to track the state information it cares about.  = (Application-specific drivers can help here, but application-specific = drivers frequently break other applications by changing the driver = semantics and refusing to get out the way when other applications are = using the device.) Another common method for recovery on tape is a = checkpoint/restart mechanism.  Every once in a while, the = application requests unbuffered filemarks or setmarks, then saves away = checkpoint state information.  Whenever a failure occurs, the = application backs up and recovers from the last successful = checkpoint.  This is almost invariably implemented at application = level, since the application state information is needed for = restarting.  (Some OS's provide tools to assist = checkpointing.) As I said, these probably aren't the answers you = want, but it's kind'a where things are today. -roger PS.  I'll apologize now if I messed up the = quoting levels below, but I wanted to trim out the stuff I wasn't = replying to, and I was actually working off of Dave's reply = email. PPS. Due to buffering issues, the case you describe = with the lost FCP_RSP is a potential problem even if the command does = involve data transfers. 
> Santosh Rao wrote: 
... 
> Issue 2 
> =3D=3D=3D=3D=3D=3D=3D 
> How should the initiator differentiate b/n a = FCP_RJT from a target due 
> to a lost FCP_CMD (the scenario described in = Annexe C Fig C.2) and the 
> case where the target has discarded exchange = state due to the expiration 
> of RR_TOV after sending FCP_RSP. 
> 
> In both the above cases, our interpretation of = FCP-2 is that the 
> initiator will see a FCP_RJT response to the = REC with : 
> rsn_code =3D ;Logical Error; or = ;Unable to perform command request; 
> rsn_expln =3D ;Invalid OXID-RXID = combination; 
> 
> In this case, the initiator cannot apply the = same error recovery for the 
> 2 cases. In the lost FCP_CMND case, the = initiator may safely re-issue 
> the command. The latter case could occur in the = following manner : 
> 
> - Initiator issues a command which does not = involve data xfer. 
> - Target sends FCP_RSP, FCP_RSP is lost. 
> - Initiator REC_TOV timer pops and initiators = sends REC. 
> - REC times out after RA_TOVels (which is > = RR_TOV, for fabric) 
> - Initiator aborts REC and issues another = REC 
> - Target sends FCP_RJT response since it has = discarded the exchange 
> state. 
> 
> In the above case, the initiator MUST NOT = re-issue the FCP_CMND, since 
> this can potentially cause a data corruption = with tape devices. (ex : 
> re-issuing a scsi command like SPACE, WRITE = FILEMARKS when they had 
> previously been executed successfully can cause = tape data corruption.) 
> 
> Can someone clarify on how FCP-2 differentiates = these 2 cases? Without 
> the ability to differentiate between these 2 = cases, the use of SLER in a 
> lost FCP_CMND scenario can result in potential = data corruption with tape 
> devices. 
------_=_NextPart_001_01C228FD.6AB021C0--




More information about the T10 mailing list