Queuing and ACA

Joseph C. Nemeth jnemeth at concentric.net
Fri May 8 14:02:26 PDT 1998


* From the T10 (formerly SCSI) Reflector (t10 at symbios.com), posted by:
* "Joseph C. Nemeth" <jnemeth at concentric.net>
*
-----Original Message-----
From: ROWEBER at acm.org <ROWEBER at acm.org>
To: T10 at Symbios.COM <T10 at Symbios.COM>
Date: Friday, May 08, 1998 5:36 AM
Subject: Re: RE: Queuing and ACA


>Gerry Houlder did an excellent job of explaining why ACA ACTIVE is a
>different status from BUSY.  My understanding of the reason commands
>are not allowed to pile up in the blocked queue is related to protocols
>such as Fibre Channel.
>
>In a protocol such as FCP, a series of commands can be "in flight" to the
>target.  For performance reasons, the commands may have been put "in
flight"
>in a way that has some of the later commands depending on the successful
>completion of some of the earlier commands.  I.E., assume all will succeed,
>and get those commands to the drive as quickly as possible.
>
>If this model is followed (and there are those who call this practice
>insane or worse), then the downstream "in flight" commands need to be
>returned to the initiator for reconsideration whenever a command fails.
>This is the reason for the ACA behavior.
>

Well, my objective in posting was to get a better understanding of this, and
that has been at least partially successful -- I liked Gerry's explanation
of ACA ACTIVE as a fast BUSY, too. But the rest of it isn't hanging together
for me.

I'm not sure whether I think the practice you've described is insane or not,
but it certainly is if NACA is supported. The problem I see is that,
assuming you have a whole cluster of  commands "in flight," you are suddenly
going to have a group of these bounced out of the queue, but the commands
following them may actually get into the queue. This could REALLY mess
things up.

Scenario: initiator A, with a deep initiator queue (or a lot of layers),
puts a cluster of disk write commands to the same sector, using the
ordered-queue attribute, "in flight." Initiator B issues a doomed NACA
command. The first N writes from initiator A will enter the queue normally,
before ACA occurs. The second M writes from initiator A are bounced with ACA
ACTIVE status, while initiator B is cleaning up its error. The final L
writes from initiator A enter the queue normally, after ACA is cleared by
initiator B. Now initiator A gets around to doing the recommended retry of
the commands bounced with ACA ACTIVE, so the M set of writes gets re-issued.
The original sequence of writes was NML, but the sequence that
ends up in the target queue is NLM, all because of a completely different
initiator on the interface. Since these writes are posited as all going to
the same sector on the disk, we now have bad data on the disk.

Even if initiator A decides to do something more extensive than a simple
retry, it may not be able to get to this before some or all of the commands
|from group L start executing. The only hope is if initiator B -- the
faulting initiator -- performs the cleanup on initiator A's commands before
releasing ACA, while the queue is still frozen.

Actually, the problem can only occur for disk drives, because any sequential
device that has more than one initiator controlling it at a time is already
in trouble -- that's what device reservation is for, and in any
multi-initiator system, tape drives are going to be reserved and this won't
happen.

Bottom line as I understand it: don't do this! NACA and "in-flight" command
management can co-exist ONLY if ALL commands are guaranteed to be
order-independent, and there ain't no such beast -- sequential devices are
always order-sensitive, and disk devices are order-sensitive to repeated
writes to the same sector.

I still think continued queuing is the best policy. I can't buy Gerry's
timeout issue, because timeouts HAVE to be pretty generous -- they already
have to accomodate normal queue residence times, which are highly variable
because the queue may already contain ordered commands from other
initiators, or may take on head-of-queue commands at any time. If timeouts
are so tight that they can't handle the addition of the ACA cleanup to all
this, the timeouts are way too tight in the first place.

Nor am I seeing that ACA ACTIVE status tells me anything useful at all. If
I'm the faulting initiator, I don't need ACA ACTIVE, because I've already
received the Check Condition that is causing the ACA. If I'm NOT the
faulting initiator, I can't do anything about the problem, anyway (short of
pulling the plug with a Target Reset), so I'd rather just not know about it.
My command takes a little longer -- so what?

I'm thinking it will be a long time before I turn on the NACA bit in the
Inquiry data.

Is anyone out there actually USING this feature? If so, what am I missing
here?



*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at symbios.com





More information about the T10 mailing list