SCSI-3 FCP ACA/QErr abort process
Joseph Carl Nemeth
jnemeth at concentric.net
Sun Aug 24 23:04:31 PDT 1997
* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by:
* Joseph Carl Nemeth <jnemeth at concentric.net>
*
Bob: thanks. Between you and Charles, I'm getting a much clearer picture of
this.
I know I'm coming to this party very late, and I don't want to sound as
though I'm complaining the beer is warm and the coffee cold, but I am very
curious as to why this was defined in this fashion. It seems to me that the
target should never, ever, EVER be allowed to abort a command under any
conditions. The initiator, of course, can do this -- it *has* to be able to
do this -- and even the wholesale destruction caused by things like a
Device Reset or a Clear Command Queue has its place. But it seems to me
that once an initiator has told the target to do anything, and doesn't tell
it otherwise, the target should be honor-bound (and standard-bound) to
either complete the operation, or report back that it didn't do it. Either
way, it reports. Period. Thus, technically, every catastrophic error
condition recognized *by the target* results in a command "termination"
rather than an "abort," and the worst that will happen is that hosts will
queue up a whole lot of commands, and then get a whole cascade of failures
with error codes that say something like "I tried, but I couldn't do it and
it was someone else's fault." That would resolve *all* of these evil
circumstances, and would be easier for the host to clean up and perform
error recovery, as well. It would certainly be easier to implement from a
target standpoint.
There must have been a reason for doing it otherwise, and I'm curious as to
what it was. If not, maybe this could be addressed somewhere down the road,
such as in SAM-2. ???
The reason I'm pushing for answers in such bizarre corner cases is that
we're designing around a high-level FCP chip that manages all the FCP stuff
for us. In target mode, this chip does not provide any capabilities to
allow me, as the target, to recover resources allocated within the chip --
that is, I can't issue the ABTS as the target. It seemed to me from the
SAM, et al., that I needed such a capability. I'm still not completely
clear on that -- any suggestions?
I absolutely agree with your "prejudices" -- they make a lot of sense. My
only concern here is that I just write the target code, and don't have a
lot of control (any at all, for that matter) on how the initiator driver
writers will do their thing.
I think what I will try to get away with is to set QErr to zero and make it
non-changeable, and prevent NACA in the Inquiry block, and simply avoid
this whole situation. That should work unless I have to accommodate generic
drivers that require one or both capabilities. Unfortunately, I've run into
far too many drivers that, instead of inquiring about device capabilities
and adapting, instead inquire about device capabilities and then refuse to
talk to devices they don't like -- even if they never use the features they
demand.
By the way, I'm actually working with tapes and changers, rather than disks
and RAIDs. The tapes aren't going to cause much trouble, because the
commands are all Ordered (even if they aren't, they have to be treated that
way), and you simply don't have multiple initiators trying to share a tape
drive (anyone who does deserves almost anything he gets.) The changers,
however, are a lot like RAIDs in that they will be shared by multiple
initiators and gain a lot of performance from reordering and concurrency.
Again, thanks for all the assistance -- Joe
----------
From: Bob Snively
Sent: Sunday, August 24, 1997 10:18 PM
To: jnemeth at concentric.net; t10 at symbios.com
Subject: Re: SCSI-3 FCP ACA/QErr abort process
This is a bit messier than has actually been addressed here, since you
are apparently considering both FCP and Parallel SCSI implementations.
Let me start from your original mail and second mail in the thread.
First allow me to expose my prejudices. Simple tagged queueing should
always be used for stateless devices like disks and RAIDs. NACA should
never be used for stateless devices like disks and RAIDs. Failures
should never cause the termination of other queued commands. Then your
scenarios collapse to very simple well-behaved high performance operation
with trivial recovery algorithms.
But then, let me see if I can address the issues I think you raised:
A) Abort actions.
Yes, there are multiple types. One type is associated with
Task Control functions and is explicitly requested and acknowledged.
The other type is caused implicitly by such events as QErr and
Persistent Reservation Preempt and Clear service actions.
In parallel SCSI, the implicit events just happen and are not
labeled. In FCP, the implicit events may require recovery abort
operations (which look just like ABORT TASK functions) to recover
resources afterwards. In either case, Unit Attention may be
created for the initiators whose commands were deleted.
B) Overlapped commands
If I remember correctly (and this should be clearly laid out in
SCSI-2 and more obtusely laid out in SCSI-3), the occurrence of
a command to the same ITLQ nexus (having the same initiator, target,
LUN, and queue tag) is an overlapped command. Since this is
a software/firmware bug, the recovery is dramatic. You terminate
both commands asap, then provide a check condition for the
second, not the first. Fortunately, this is only a problem and
characteristic of parallel SCSI. FCP does not use the queue tag
and instead uses the Exchange value as the final qualifier for
the Fully Qualified Exchange Identifier. Overlapping FQE Identifiers
cause FC protocol failures like rejects, unrelated to SCSI, and
break the second command.
C) QErr aborts
In parallel SCSI, they are silent. In FCP, they are optionally
silent, but 7.1.2.5 allows the recovery abort to be performed if
there is any question about resources being partially cleared up.
D) RAID scenario
Truly a very evil scenario. My strong suggestion is:
No ACA
QErr set to allow continuation of undamaged commands.
If you choose to follow your scenario instead of the recommendation,
you have my sympathy, but even then, things are not quite so bleak.
Recovery will typically be faster than you suggested. On alternate
ports/initiators, any new command will either run into ACA ACTIVE
while the ACA condition exists (indicating it is about to be
blown away) or it will run into Unit Attention (indicating that
the throwing away already occurred). If no new commands are
coming out (unlikely in your scenario), the commands will start
to generate ULP timeouts, which will invoke commands that will
subsequently clean things up. In FCP, you have the additional
clue that recovery aborts may have been performed to clear
exchange resources for the aborted commands.
E) Timing of QErr abort
It doesn't really matter much, and really depends on the
device type and protocol. I would expect that while ACA was
active, you would get ACA ACTIVE status on any new commands.
The old commands would not be proceeding and you would not
know whether or not they had been aborted. After ACA was cleared,
you would get Unit Attention on the first new commands. The
old commands would be aborted and you would not know when that
actually occurred, since they would have stopped forward progress
roughly when the error occurred. As a result, the actual
implementation would be indistinguishable.
Hope this is clear, mostly right, and helpful,
Bob
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at symbios.com
More information about the T10
mailing list