SCSI-3 FCP ACA/QErr abort process

Bob Snively Bob.Snively at Eng.Sun.COM
Sun Aug 24 21:18:35 PDT 1997


* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by:
* Bob Snively <Bob.Snively at Eng.Sun.COM>
*
This is a bit messier than has actually been addressed here, since you
are apparently considering both FCP and Parallel SCSI implementations.
Let me start from your original mail and second mail in the thread.
First allow me to expose my prejudices.  Simple tagged queueing should
always be used for stateless devices like disks and RAIDs.  NACA should
never be used for stateless devices like disks and RAIDs.  Failures
should never cause the termination of other queued commands.  Then your
scenarios collapse to very simple well-behaved high performance operation
with trivial recovery algorithms.
But then, let me see if I can address the issues I think you raised:
A)	Abort actions.
	Yes, there are multiple types.  One type is associated with
	Task Control functions and is explicitly requested and acknowledged.
	The other type is caused implicitly by such events as QErr and
	Persistent Reservation Preempt and Clear service actions.
	In parallel SCSI, the implicit events just happen and are not
	labeled.  In FCP, the implicit events may require recovery abort
	operations (which look just like ABORT TASK functions) to recover
	resources afterwards.  In either case, Unit Attention may be
	created for the initiators whose commands were deleted.
B)	Overlapped commands
	If I remember correctly (and this should be clearly laid out in
	SCSI-2 and more obtusely laid out in SCSI-3), the occurrence of
	a command to the same ITLQ nexus (having the same initiator, target,
	LUN, and queue tag) is an overlapped command.  Since this is 
	a software/firmware bug, the recovery is dramatic.  You terminate
	both commands asap, then provide a check condition for the
	second, not the first.  Fortunately, this is only a problem and
	characteristic of parallel SCSI.  FCP does not use the queue tag
	and instead uses the Exchange value as the final qualifier for
	the Fully Qualified Exchange Identifier.  Overlapping FQE Identifiers
	cause FC protocol failures like rejects, unrelated to SCSI, and
	break the second command.
C)	QErr aborts
	In parallel SCSI, they are silent.  In FCP, they are optionally
	silent, but 7.1.2.5 allows the recovery abort to be performed if
	there is any question about resources being partially cleared up.
D)	RAID scenario
	Truly a very evil scenario.  My strong suggestion is:
		No ACA
		QErr set to allow continuation of undamaged commands.
	If you choose to follow your scenario instead of the recommendation,
	you have my sympathy, but even then, things are not quite so bleak.
	Recovery will typically be faster than you suggested.  On alternate
	ports/initiators, any new command will either run into ACA ACTIVE
	while the ACA condition exists (indicating it is about to be
	blown away) or it will run into Unit Attention (indicating that
	the throwing away already occurred).  If no new commands are
	coming out (unlikely in your scenario), the commands will start
	to generate ULP timeouts, which will invoke commands that will
	subsequently clean things up.  In FCP, you have the additional
	clue that recovery aborts may have been performed to clear
	exchange resources for the aborted commands.
E)	Timing of QErr abort
	It doesn't really matter much, and really depends on the
	device type and protocol.  I would expect that while ACA was
	active, you would get ACA ACTIVE status on any new commands.
	The old commands would not be proceeding and you would not
	know whether or not they had been aborted.  After ACA was cleared,
	you would get Unit Attention on the first new commands.  The
	old commands would be aborted and you would not know when that
	actually occurred, since they would have stopped forward progress
	roughly when the error occurred.  As a result, the actual
	implementation would be indistinguishable.
Hope this is clear, mostly right, and helpful,
Bob
First mail from Nemeth.
> Subject: SCSI-3 FCP ACA/QErr abort process
> *
> I am having a hard time determining exactly what it means for a target to 
> "abort" a task in the case of the ACA condition with the Mode Select QErr 
> bit set to 1. I would appreciate a response from anyone who knows exactly 
> how this is supposed to work.
> There appear to be two very different kinds of "abort" actions.
> The first kind of "abort" is in response to any Task Control Function that 
> aborts established tasks. In this case, it seems clear from the SCSI-3 SAM 
> that once the Function Complete response is made for the Task Control 
> Function itself, the aborted tasks are simply blown away -- they must not 
> have any further interactions with the initiator. Specifically, the target 
> will not send any more FCP_XFER_RDY, FCP_DATA, or FCP_RSP IUs to the 
> initiator for any task that has been aborted, and therefore, cannot return 
> any kind of status or autosense data for those tasks.
> The second kind of "abort" is in response to certain classes of error that 
> are detected by the target. One example is the "overlapped command" 
> condition, in which an initiator sends two overlapped Untagged commands. In 
> this case, it seems clear that both commands are actually "terminated" 
> rather than "aborted," completing (albeit prematurely) with Check 
> Condition/COMMAND ABORTED/Overlapped Commands Attempted error status -- the 
> implementor's note makes it clear that the aborted (first) command may need 
> to report a residue, and I don't know how it would do this in FCP without 
> being able to post its status and autosense data.
> How, then, is the QErr "abort" handled? If there are multiple Simple (or 
> Untagged) queue commands in the enabled state for a logical unit, and an 
> ACA condition occurs due to an error on one of them, they all go from the 
> enabled to the blocked state, and if the QErr flag is set, they must all be 
> "aborted." Is this a silent abort, as though a kind of Abort Task Set had 
> been issued from the host, or is it a noisy abort, in which each command 
> actually terminates and returns error status? If the former, how is 
> catastrophic data loss avoided? If the latter, exactly what status and 
> sense data is returned? And how does that status interact with the Unit 
> Attention condition for other (non-faulting) initiators?
> Any assistance in understanding this would be appreciated.
> *
> * For T10 Reflector information, send a message with
> * 'info t10' (no quotes) in the message body to majordomo at symbios.com
Second mail from Nemeth
> I'm still very puzzled by the QErr thing. Let me describe a scenario.
> Let's say I'm building a big RAID device. It has lots of spindles in it, 
> and I can get a whole lot of parallelism out of it -- that is, a whole lot 
> of Write commands could be issued to different logical sectors of a single 
> logical unit and all be executed concurrently.
> Let's say multiple initiators send either Simple or Untagged commands 
> (properly, with no overlaps!) to this logical unit, and don't send any 
> Ordered or Head-of-queue commands. Let's say that the Mode Select 
> parameters are set to allow unrestricted reordering of commands, giving me 
> freedom to reorder these Simple commands and execute as many of them 
> concurrently as I can. Thus, in my understanding, each command joins the 
> set of enabled commands as soon as it is queued by the logical unit, and it 
> may begin executing at any time. Let's also say that NACA is supported, and 
> every one of these commands has the NACA bit set.
> Now one of the spindles drops a bit, the faulting command reports Check 
> Condition, and the whole logical unit goes into the ACA condition. This 
> puts all of the enabled commands (which is *all* of them, even the ones 
> that are queued and have not yet started running) into the blocked state. 
> If QErr is clear, then when ACA is cleared, all of these blocked commands 
> return to the enabled state, and life goes on. However, if QErr is set, 
> clearing the ACA condition is supposed to abort all of the blocked 
> commands. Again, that's *all* of them, including commands from other 
> initiators.
> If this is a silent abort on all these commands, these other initiators 
> won't even know anything happened -- they'll hang, waiting for the data 
> transfer to resume, and will (hopefully) eventually time out. At that 
> point, they'll issue a command to the device, and finally get their first 
> error indication: a Unit Attention condition, indicating that some other 
> initiator aborted their command. This doesn't sound right at all. ??? What 
> am I missing here?
> There is also a new apparent contradiction between SAM and the r11a version 
> of the SPC document where the QErr bit is described: SAM clearly implies in 
> several places that QErr comes into play when ACA is cleared, and an older 
> rev of SPC (r7) agreed with this. The r11a version of the SPC now says that 
> when QErr is set, tasks are aborted when Check Condition or Command Term  
> inated is *sent*, which is what I thought *set* the ACA condition. ???
> *
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at symbios.com




More information about the T10 mailing list