SCSI-3 FCP ACA/QErr abort process

Joseph Carl Nemeth jnemeth at concentric.net
Fri Aug 22 17:45:16 PDT 1997


* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by:
* Joseph Carl Nemeth <jnemeth at concentric.net>
*
----------
From: 	Charles Monia
Sent: 	Friday, August 22, 1997 4:57 PM
To: 	'Joseph Carl Nemeth'; 'T10 Reflector'
Subject: 	RE: SCSI-3 FCP ACA/QErr abort process
<text deleted>
Specifically, the target   will not send any more FCP_XFER_RDY, FCP_DATA, 
or FCP_RSP IUs to the
initiator for any task that has been aborted, and therefore, cannot return
any kind of status or autosense data for those tasks.
[CAM]  SAM defines behavior as seen by the "application client" -- 
analogous, to NT's class driver for example.  With that in mind, I'd 
restate the above to say that no further status or autosense data should be 
sent to the  "Application Client" or NT Class Driver (to continue the 
example). To see what's expected at the transport layer, you've got to look 
at the spec for the transport protocol  -- FCP in this case.
Right - the FCP does the "recovery abort" thing, which involves some 
chit-chat between the initiator and target at the transport layer, but I 
was actually referring to the application client layer: so far as the 
application client is concerned, the aborted exchange vanishes off the face 
of the earth.
The second kind of "abort" is in response to certain classes of error that
are detected by the target. One example is the "overlapped command"
condition, in which an initiator sends two overlapped Untagged commands.
<text deleted>
[CAM]  There is a semantics problem here.  As I interpret your scenario, 
one untagged command (command 1) is sent followed by another (command 2). 
 The "duplicate command" error only applies to command 2.  If QErr were 
clear, command 1 would be allowed to complete normally (once the ACA 
condition was cleared).
The resultant behavior with QErr set is:
Command 1 -- Terminated with CHECK CONDITION.
Command 2 -- Aborted.
The aborted command is treated as if it was aborted with any one of the 
explicit task abort functions (CLEAR QUEUE,  ABORT TASK, etc). In that 
case, no residue or status from command 2 is returned to the application 
client/class driver.
AHA! Thank you! This is exactly the *opposite* of what I thought SAM was 
saying the FIRST time I read it, and it now makes perfect sense. SAM 5.6.2 
- "A logical unit that detects an overlapped command shall abort all tasks 
for the initiator in the task set and shall return CHECK CONDITION status 
for [that] command." You just explained which one "that" command was, and 
now this makes sense.
However, as I read this, QErr is irrelevant to this situation -- this 
should happen even if the QErr bit is clear. ???
And yes, this is a big-time host driver bug!
How, then, is the QErr "abort" handled? If there are multiple Simple (or
Untagged) queue commands in the enabled state for a logical unit, and an
ACA condition occurs due to an error on one of them, they all go from the
enabled to the blocked state, and if the QErr flag is set, they must all be 
"aborted." Is this a silent abort, as though a kind of Abort Task Set had
been issued from the host,
[CAM]
[CAM]  Yes -- just as if they had been explicitly aborted as described 
above.
or is it a noisy abort, in which each command
actually terminates and returns error status? If the former, how is
catastrophic data loss avoided?
[CAM]  Depends on the device type.  For disks, the device driver could 
simply reissue all the unfinished commands.  That's often how it's done. 
 In your scenario, a duplicate untagged command indicates a bug in the host 
software in which case, I'd expect the O/S to crash the system with a bug 
check before more damage is done.
I'm still very puzzled by the QErr thing. Let me describe a scenario.
Let's say I'm building a big RAID device. It has lots of spindles in it, 
and I can get a whole lot of parallelism out of it -- that is, a whole lot 
of Write commands could be issued to different logical sectors of a single 
logical unit and all be executed concurrently.
Let's say multiple initiators send either Simple or Untagged commands 
(properly, with no overlaps!) to this logical unit, and don't send any 
Ordered or Head-of-queue commands. Let's say that the Mode Select 
parameters are set to allow unrestricted reordering of commands, giving me 
freedom to reorder these Simple commands and execute as many of them 
concurrently as I can. Thus, in my understanding, each command joins the 
set of enabled commands as soon as it is queued by the logical unit, and it 
may begin executing at any time. Let's also say that NACA is supported, and 
every one of these commands has the NACA bit set.
Now one of the spindles drops a bit, the faulting command reports Check 
Condition, and the whole logical unit goes into the ACA condition. This 
puts all of the enabled commands (which is *all* of them, even the ones 
that are queued and have not yet started running) into the blocked state. 
If QErr is clear, then when ACA is cleared, all of these blocked commands 
return to the enabled state, and life goes on. However, if QErr is set, 
clearing the ACA condition is supposed to abort all of the blocked 
commands. Again, that's *all* of them, including commands from other 
initiators.
If this is a silent abort on all these commands, these other initiators 
won't even know anything happened -- they'll hang, waiting for the data 
transfer to resume, and will (hopefully) eventually time out. At that 
point, they'll issue a command to the device, and finally get their first 
error indication: a Unit Attention condition, indicating that some other 
initiator aborted their command. This doesn't sound right at all. ??? What 
am I missing here?
There is also a new apparent contradiction between SAM and the r11a version 
of the SPC document where the QErr bit is described: SAM clearly implies in 
several places that QErr comes into play when ACA is cleared, and an older 
rev of SPC (r7) agreed with this. The r11a version of the SPC now says that 
when QErr is set, tasks are aborted when Check Condition or Command Term  
inated is *sent*, which is what I thought *set* the ACA condition. ???
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at symbios.com




More information about the T10 mailing list