SCSI-3 FCP ACA/QErr abort process
Joseph Carl Nemeth
jnemeth at concentric.net
Fri Aug 22 17:45:16 PDT 1997
* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by:
* Joseph Carl Nemeth <jnemeth at concentric.net>
From: Charles Monia
Sent: Friday, August 22, 1997 4:57 PM
To: 'Joseph Carl Nemeth'; 'T10 Reflector'
Subject: RE: SCSI-3 FCP ACA/QErr abort process
Specifically, the target will not send any more FCP_XFER_RDY, FCP_DATA,
or FCP_RSP IUs to the
initiator for any task that has been aborted, and therefore, cannot return
any kind of status or autosense data for those tasks.
[CAM] SAM defines behavior as seen by the "application client" --
analogous, to NT's class driver for example. With that in mind, I'd
restate the above to say that no further status or autosense data should be
sent to the "Application Client" or NT Class Driver (to continue the
example). To see what's expected at the transport layer, you've got to look
at the spec for the transport protocol -- FCP in this case.
Right - the FCP does the "recovery abort" thing, which involves some
chit-chat between the initiator and target at the transport layer, but I
was actually referring to the application client layer: so far as the
application client is concerned, the aborted exchange vanishes off the face
of the earth.
The second kind of "abort" is in response to certain classes of error that
are detected by the target. One example is the "overlapped command"
condition, in which an initiator sends two overlapped Untagged commands.
[CAM] There is a semantics problem here. As I interpret your scenario,
one untagged command (command 1) is sent followed by another (command 2).
The "duplicate command" error only applies to command 2. If QErr were
clear, command 1 would be allowed to complete normally (once the ACA
condition was cleared).
The resultant behavior with QErr set is:
Command 1 -- Terminated with CHECK CONDITION.
Command 2 -- Aborted.
The aborted command is treated as if it was aborted with any one of the
explicit task abort functions (CLEAR QUEUE, ABORT TASK, etc). In that
case, no residue or status from command 2 is returned to the application
AHA! Thank you! This is exactly the *opposite* of what I thought SAM was
saying the FIRST time I read it, and it now makes perfect sense. SAM 5.6.2
- "A logical unit that detects an overlapped command shall abort all tasks
for the initiator in the task set and shall return CHECK CONDITION status
for [that] command." You just explained which one "that" command was, and
now this makes sense.
However, as I read this, QErr is irrelevant to this situation -- this
should happen even if the QErr bit is clear. ???
And yes, this is a big-time host driver bug!
How, then, is the QErr "abort" handled? If there are multiple Simple (or
Untagged) queue commands in the enabled state for a logical unit, and an
ACA condition occurs due to an error on one of them, they all go from the
enabled to the blocked state, and if the QErr flag is set, they must all be
"aborted." Is this a silent abort, as though a kind of Abort Task Set had
been issued from the host,
[CAM] Yes -- just as if they had been explicitly aborted as described
or is it a noisy abort, in which each command
actually terminates and returns error status? If the former, how is
catastrophic data loss avoided?
[CAM] Depends on the device type. For disks, the device driver could
simply reissue all the unfinished commands. That's often how it's done.
In your scenario, a duplicate untagged command indicates a bug in the host
software in which case, I'd expect the O/S to crash the system with a bug
check before more damage is done.
I'm still very puzzled by the QErr thing. Let me describe a scenario.
Let's say I'm building a big RAID device. It has lots of spindles in it,
and I can get a whole lot of parallelism out of it -- that is, a whole lot
of Write commands could be issued to different logical sectors of a single
logical unit and all be executed concurrently.
Let's say multiple initiators send either Simple or Untagged commands
(properly, with no overlaps!) to this logical unit, and don't send any
Ordered or Head-of-queue commands. Let's say that the Mode Select
parameters are set to allow unrestricted reordering of commands, giving me
freedom to reorder these Simple commands and execute as many of them
concurrently as I can. Thus, in my understanding, each command joins the
set of enabled commands as soon as it is queued by the logical unit, and it
may begin executing at any time. Let's also say that NACA is supported, and
every one of these commands has the NACA bit set.
Now one of the spindles drops a bit, the faulting command reports Check
Condition, and the whole logical unit goes into the ACA condition. This
puts all of the enabled commands (which is *all* of them, even the ones
that are queued and have not yet started running) into the blocked state.
If QErr is clear, then when ACA is cleared, all of these blocked commands
return to the enabled state, and life goes on. However, if QErr is set,
clearing the ACA condition is supposed to abort all of the blocked
commands. Again, that's *all* of them, including commands from other
If this is a silent abort on all these commands, these other initiators
won't even know anything happened -- they'll hang, waiting for the data
transfer to resume, and will (hopefully) eventually time out. At that
point, they'll issue a command to the device, and finally get their first
error indication: a Unit Attention condition, indicating that some other
initiator aborted their command. This doesn't sound right at all. ??? What
am I missing here?
There is also a new apparent contradiction between SAM and the r11a version
of the SPC document where the QErr bit is described: SAM clearly implies in
several places that QErr comes into play when ACA is cleared, and an older
rev of SPC (r7) agreed with this. The r11a version of the SPC now says that
when QErr is set, tasks are aborted when Check Condition or Command Term
inated is *sent*, which is what I thought *set* the ACA condition. ???
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at symbios.com
More information about the T10