SCSI-3 FCP ACA/QErr abort process

Joseph Carl Nemeth jnemeth at concentric.net
Sun Aug 24 23:04:31 PDT 1997


* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by:
* Joseph Carl Nemeth <jnemeth at concentric.net>
*
Bob: thanks. Between you and Charles, I'm getting a much clearer picture of 
this.
I know I'm coming to this party very late, and I don't want to sound as 
though I'm complaining the beer is warm and the coffee cold, but I am very 
curious as to why this was defined in this fashion. It seems to me that the 
target should never, ever, EVER be allowed to abort a command under any 
conditions. The initiator, of course, can do this -- it *has* to be able to 
do this -- and even the wholesale destruction caused by things like a 
Device Reset or a Clear Command Queue has its place. But it seems to me 
that once an initiator has told the target to do anything, and doesn't tell 
it otherwise, the target should be honor-bound (and standard-bound) to 
either complete the operation, or report back that it didn't do it. Either 
way, it reports. Period. Thus, technically, every catastrophic error 
condition recognized *by the target* results in a command "termination" 
rather than an "abort," and the worst that will happen is that hosts will 
queue up a whole lot of commands, and then get a whole cascade of failures 
with error codes that say something like "I tried, but I couldn't do it and 
it was someone else's fault." That would resolve *all* of these evil 
circumstances, and would be easier for the host to clean up and perform 
error recovery, as well. It would certainly be easier to implement from a 
target standpoint.
There must have been a reason for doing it otherwise, and I'm curious as to 
what it was. If not, maybe this could be addressed somewhere down the road, 
such as in SAM-2. ???
The reason I'm pushing for answers in such bizarre corner cases is that 
we're designing around a high-level FCP chip that manages all the FCP stuff 
for us. In target mode, this chip does not provide any capabilities to 
allow me, as the target, to recover resources allocated within the chip -- 
that is, I can't issue the ABTS as the target. It seemed to me from the 
SAM, et al., that I needed such a capability. I'm still not completely 
clear on that -- any suggestions?
I absolutely agree with your "prejudices" -- they make a lot of sense. My 
only concern here is that I just write the target code, and don't have a 
lot of control (any at all, for that matter) on how the initiator driver 
writers will do their thing.
I think what I will try to get away with is to set QErr to zero and make it 
non-changeable, and prevent NACA in the Inquiry block, and simply avoid 
this whole situation. That should work unless I have to accommodate generic 
drivers that require one or both capabilities. Unfortunately, I've run into 
far too many drivers that, instead of inquiring about device capabilities 
and adapting, instead inquire about device capabilities and then refuse to 
talk to devices they don't like -- even if they never use the features they 
demand.
By the way, I'm actually working with tapes and changers, rather than disks 
and RAIDs. The tapes aren't going to cause much trouble, because the 
commands are all Ordered (even if they aren't, they have to be treated that 
way), and you simply don't have multiple initiators trying to share a tape 
drive (anyone who does deserves almost anything he gets.) The changers, 
however, are a lot like RAIDs in that they will be shared by multiple 
initiators and gain a lot of performance from reordering and concurrency.
Again, thanks for all the assistance -- Joe
----------
From: 	Bob Snively
Sent: 	Sunday, August 24, 1997 10:18 PM
To: 	jnemeth at concentric.net; t10 at symbios.com
Subject: 	Re: SCSI-3 FCP ACA/QErr abort process
This is a bit messier than has actually been addressed here, since you
are apparently considering both FCP and Parallel SCSI implementations.
Let me start from your original mail and second mail in the thread.
First allow me to expose my prejudices.  Simple tagged queueing should
always be used for stateless devices like disks and RAIDs.  NACA should
never be used for stateless devices like disks and RAIDs.  Failures
should never cause the termination of other queued commands.  Then your
scenarios collapse to very simple well-behaved high performance operation
with trivial recovery algorithms.
But then, let me see if I can address the issues I think you raised:
A)	Abort actions.
	Yes, there are multiple types.  One type is associated with
	Task Control functions and is explicitly requested and acknowledged.
	The other type is caused implicitly by such events as QErr and
	Persistent Reservation Preempt and Clear service actions.
	In parallel SCSI, the implicit events just happen and are not
	labeled.  In FCP, the implicit events may require recovery abort
	operations (which look just like ABORT TASK functions) to recover
	resources afterwards.  In either case, Unit Attention may be
	created for the initiators whose commands were deleted.
B)	Overlapped commands
	If I remember correctly (and this should be clearly laid out in
	SCSI-2 and more obtusely laid out in SCSI-3), the occurrence of
	a command to the same ITLQ nexus (having the same initiator, target,
	LUN, and queue tag) is an overlapped command.  Since this is
	a software/firmware bug, the recovery is dramatic.  You terminate
	both commands asap, then provide a check condition for the
	second, not the first.  Fortunately, this is only a problem and
	characteristic of parallel SCSI.  FCP does not use the queue tag
	and instead uses the Exchange value as the final qualifier for
	the Fully Qualified Exchange Identifier.  Overlapping FQE Identifiers
	cause FC protocol failures like rejects, unrelated to SCSI, and
	break the second command.
C)	QErr aborts
	In parallel SCSI, they are silent.  In FCP, they are optionally
	silent, but 7.1.2.5 allows the recovery abort to be performed if
	there is any question about resources being partially cleared up.
D)	RAID scenario
	Truly a very evil scenario.  My strong suggestion is:
		No ACA
		QErr set to allow continuation of undamaged commands.
	If you choose to follow your scenario instead of the recommendation,
	you have my sympathy, but even then, things are not quite so bleak.
	Recovery will typically be faster than you suggested.  On alternate
	ports/initiators, any new command will either run into ACA ACTIVE
	while the ACA condition exists (indicating it is about to be
	blown away) or it will run into Unit Attention (indicating that
	the throwing away already occurred).  If no new commands are
	coming out (unlikely in your scenario), the commands will start
	to generate ULP timeouts, which will invoke commands that will
	subsequently clean things up.  In FCP, you have the additional
	clue that recovery aborts may have been performed to clear
	exchange resources for the aborted commands.
E)	Timing of QErr abort
	It doesn't really matter much, and really depends on the
	device type and protocol.  I would expect that while ACA was
	active, you would get ACA ACTIVE status on any new commands.
	The old commands would not be proceeding and you would not
	know whether or not they had been aborted.  After ACA was cleared,
	you would get Unit Attention on the first new commands.  The
	old commands would be aborted and you would not know when that
	actually occurred, since they would have stopped forward progress
	roughly when the error occurred.  As a result, the actual
	implementation would be indistinguishable.
Hope this is clear, mostly right, and helpful,
Bob
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at symbios.com




More information about the T10 mailing list