[t10] deferred errors in host & device
Pat LaVarre
LAVARRE at iomega.com
Tue May 13 09:35:39 PDT 2003
* From the T10 Reflector (t10 at t10.org), posted by:
* "Pat LaVarre" <LAVARRE at iomega.com>
*
[ BC handrews at apple.com; keiji_katata at post.pioneer.co.jp ]
> Subject: RE: [T10] rule of Recovered error
I hope me quoting you out of order with a changed
Subject helps more than it hurts.
> It seems that "device folk" are always trying to
> obfuscate errors.
I count myself among the device folk.
To my eye it is the host folk who are always trying to
obfuscate errors.
I suspect what's actually happening is our culture of
having host and device folk not listen to each other
enough. I'm not too clear what, if anything, I can do
to combat this phenomenon effectively, but I find
tilting at windmills educational, particularly in the
replies I receive offline.
> It seems that "device folk" are always trying to
> obfuscate errors.
Seemingly few host folk design in the capability to
inject extra write errors. Accordingly, when I go and
cause a write error on purpose, I find few host folk
respond sensibly.
Why is that ok? Why do we have no standard "loopback"
for injecting arbitrary errors of interest? I have no
idea.
> your description of "deferred errors" is
> "massively unreal"
Thanks for speaking plainly.
I'm speaking of repeatable lab results, so discovering
why my reality doesn't match yours is likely to
educate me, if by chance you have the patience to
finish the job.
> You may translate "deferred error" to be identical
> to "corrupted data". In many systems, there simply
> is no way to recover.
Here we agree.
I have never yet seen a host respond to a deferred
error by explaining to the customer that the entire
disk should be distrusted.
> As far as I know, the host DOES retry the entire
> command.
Good to hear of hosts responding conservatively to
current errors, thank you.
> The devices had BETTER NOT ALLOW DEFERRED ERRORS.
Failure happens, I trust that's not in dispute.
The best example I saw recently was a blurb re
"undetected soft errors". I guess that's a euphemism
for not reading what was written, often termed a
"miscompare", though I can't be sure.
> The devices had BETTER NOT ALLOW DEFERRED ERRORS.
Deferred rather than current failure is a choice, I
agree.
Hosts force devices to defer errors by forcing devices
to defer operations, most egregiously by setting an
Immed bit, but also in other ways.
A host that doesn't want to experience deferred errors
more commonly than the O.S. crashes anyhow has to
persuade the device to defer no operations.
I have never yet seen such a host.
> I suppose I should be grateful that a recovered
> error was returned in the first place ;-)
I may have slipped off the topic of recovered errors,
hence my change to the Subject line.
I do think the host should be grateful if a device
bothers to report a deferred error. Since few hosts
in the field yet respond sensibly, any such device is
working in hope of sometimes encountering a more
considerate host someday.
> ... useful in the diagnostics to know the address
> of a REAL recovered error rather than simply the
> first block of the transfer.
Yes. But here now in 2003 we have no practical Scsi
standard for reporting a collection of deferred errors.
> I'm not proposing that partial retries are a good
> idea.
Good.
> I simply want more meaningful error LBAs.
Me too. But it's not simple.
> Locating the "true" address of the bad spot should
> NOT involve "searching" for it.
Wish it were true.
> The whole point of the recovered error is to point
> out where the trouble was detected.
Yes. So in MMC 4 we are inventing a scheme for
reporting every error that occurred, not just one of
them?
> It seems that "device folk" are always trying to
> obfuscate errors. Some seem to think a timeout is
> better than a reported error.
I've seen people hang rather than report a deferred
error, and then argue that deferred errors don't
happen often enough to matter.
> I find it particularly irksum that a read can take
> 30 seconds or more without completing.
Me too.
> If there's a non-recoverable error, then please
> report it.
Who can say how long recovery may take?
Mac's in particular have a legacy of waiting more
patiently than Windows. Some flavours of Windows
claim to have seen an error after a delay as extremely
small as 7.5 +- 0.5s.
> If there's a non-recoverable error, then please
> report it. A timeout is worse.
Yes, much worse. I like the MMC 4 scheme of letting
the device time itself out, because the device can do
a better job.
But I remember back in 1998 I proposed a friend
include in a drive ASIC a documented way for the
device firmware to spontaneously cancel a write or
read in progress.
The reaction I got was who on Earth would ever want to
do that?
> For a timeout, the host has to attempt to "clean
> up" and needs to be very careful not to mess up
> other transfers or other drives.
Ugly aye.
> Eventually the "streaming" commands should help.
> But right now they're not well supported on device
> or host side.
Yes maybe someday.
> However, your description of "deferred errors" is
> "massively unreal".
How?
Any host designed to reward a device for write-behind
encourages that device to write-behind. To eliminate
write-behind, we have to eliminate such hosts. Device
folk can't do that work - device folk can only plea
with the host folk to do it. The most we can do as
device folk is give such a host a way to turn off
write-behind.
Trouble is, that doesn't work either. As soon as we
begin to standardise a way to turn off write-behind,
host folk start to abuse it. They turn off
write-behind without changing the host design to stop
rewarding write-behind.
Pretty soon we end up with a device that disconnects
the switch. Maybe the host flips the switch to turn
off write-behind, but write-behind still happens.
Pat LaVarre
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org
More information about the T10
mailing list