serial bus tape issues
Glenn C. Everhart 603 881 1497
everhart at star.zko.dec.com
Thu Jul 11 09:43:30 PDT 1996
* From the SCSI Reflector, posted by:
* everhart at star.zko.dec.com (Glenn C. Everhart 603 881 1497)
I'm posting this at Doug Hagerman's urging; it is some bits of
a conversation about tape issues.
I'll add my recollections after ">>" to what preceded...
>> Discussions about deferred errors and the use of Class 3 for
>> tapes, and the notion that the "current model" will have the
>> devices buffer data, reply with SCSI status of success when
>> they have the data, and write it "sometime" after, using deferred
>> error reporting to tell about any problems.
From: US2RMC::STAR::EVERHART "Glenn C. Everhart 603 881 1497" 9-JUL-1996 16:11:40.33
Subj: RE: Re discussion of tape comments
The deferred errors will indeed be a nightmare. It would be far simpler
(at least for VMS) if there were an ack of command receipt by a tape
so new commands could be sent, and final I/O ack when a command
finished. We can deal with a delay of some number of writes in the
pipe OK, just treating the thing much as the SCSI queueing model
does now with some disks and doing spacing to recover position, but
the device BETTER support report position (else recovery will be REAL
slow via rewind/skip!) and should not lie about status. VMS is not
used to Russian roulette games with user data...we leave that to
other OSs. We do need to get a SCSI status, and we need to get it
reflecting a particular I/O. Otherwise we need to sit and wait, and
run class 3 REAL slowly. I'm told that fabric delays of 10 or more
seconds will be common. If that's anywhere near right, it's gonna
make toast out of any class 3 tapes. Trying to use deferred error
reporting for this is nightmarish indeed. It sort of maps to the
unix block data device buffer handling, which is susceptible to
deferred disk errors, but for user apps is an absolute and complete
nightmare. How long does one have to wait for a deferred error?
Can one make any statement about WHEN one can finally tell the user
app that his write got safely onto tape? When you get the SCSI status,
you supposedly have that kind of assurance (unless explicitly you
tell the system to do writeback cache and (hopefully) take
responsibility to guard the data some other way).
It comes down to this: you need a receipt ack to get performance
by keeping the data streaming. You need a status to tell when the
data is safe on media so the CPU can now forget about the I/O,
having completed it. Using Class 3 throws out the former, and making
devices lie about the latter so it can be used instead means that
the latter is not there for its intended and designed purpose. I
question that ANY serious OS can afford to do that.
>> Reply that once an error occurs, the tape is toast and cannot
>> be used past that point. (True of many types of tapes.)
From: US2RMC::STAR::EVERHART "Glenn C. Everhart 603 881 1497" 9-JUL-1996 16:15:23.05
Subj: RE: Re comments from Bill Martin
Some kinds of tape do indeed become toast once a command gets lost.
However, some do not, and usage patterns vary. (The Dutch navy just
has been giving some grief with tapes they find can't take any
more files after writing 27000 files on one, for example. Obviously
this is not ANSI format...)
No sense designing the protocol around some non-generic limitations IMO.
From: STARCH::HAGERMAN "Doug Hagerman, 508-841-2145, Flames to NL:" 9-JUL-1996 16:38:10.48
Subj: RE: Re discussion of tape comments
I encourage you to post your comments publicly. Internal discussions
are interesting, but there is a huge community of people out there who
need to buy into the final agreement and if they haven't heard all the
arguments, they won't buy in. Please keep making your point, but do
so in a way that more people can see your argument...
>The deferred errors will indeed be a nightmare. It would be far simpler
>(at least for VMS) if there were an ack of command receipt by a tape
>so new commands could be sent,
Under the current proposal, this one is the regular SCSI status.
>and final I/O ack when a command finished.
Under the current proposal, there isn't one. If the tape drive can't
write to the media, the media is broken. No amount of knowledge by the
operating system can help it recover from that situation. If recovery
to a known good completely written cartridge is the goal, the driver
can certainly keep track of media mounts and close out old commands
when the media is dismounted.
One thing to keep in mind is that a goal here is to move towards a more
abstract model of tapes. Ideally it would be like the TMSCP model, but
I don't think we can realistically get there. As a compromise, the assumption
currently on the table is that after the drive has the data, it has
responsibility for the data. If it can't get the data onto the tape,
there is a problem that is not going to be fixed by any amount of
repositioning or other low-level management of the device by the driver.
The goal is to push the problem into the drive, with the understanding
that we are talking about high-functionality drives with lots of new
microcode and hardware. (This is added to the cost of moving over to
Fibre Channel in the first place.)
>nightmare. How long does one have to wait for a deferred error?
>Can one make any statement about WHEN one can finally tell the user
>app that his write got safely onto tape? When you get the SCSI status,
>you supposedly have that kind of assurance (unless explicitly you
>tell the system to do writeback cache and (hopefully) take
>responsibility to guard the data some other way).
Until the media is dismounted and put into fire storage, anything
can happen. No amount of low-level control of the media helps. At
some point the driver needs to let go of the data and allow the drive
to take charge of it; why must this point be defined as "when the
data is written to the media"?
>It comes down to this: you need a receipt ack to get performance
>by keeping the data streaming. You need a status to tell when the
>data is safe on media so the CPU can now forget about the I/O,
>having completed it. Using Class 3 throws out the former, and making
>devices lie about the latter so it can be used instead means that
>the latter is not there for its intended and designed purpose. I
>question that ANY serious OS can afford to do that.
It is going to be a very, very big push to get Class 2 into PLDA. Any
arguments that support that view need to be made public.
Subj: RE: Re discussion of tape comments
ahh...but sometimes low level control CAN help (repositioning and
writing again, or skipping forward and writing again ... the kind
of drill, for example, that vms backup does at times. Doesn't
work on everything but it works on some tapes, and if we knewe
that report position was supposed to be there, it could be used,
on an error path that doesn't need to be timecritical, to
try again. Backup of course has done some stuff like that, but
what VMS would call intercept drivers, and what Unix would call
streams drivers (same concept exactly) can be inserted to give
added reliability or the ability to recover. The problem this
kind of thing solves is not so much the write problem, to be
sure, but the read-2-years-later one. Still, it can take hours to
backup a large disk farm, and something that can do media handling
as part of that (even asking for a new tape in a pinch) is part
of the job. An OS can do that, IF it knows when I/O is done and
when it is not, since it is obviously silly to restart from the
beginning of backup or even of volume. But at any rate the OS
needs to know WHEN it can forget about an I/O finally. This
cannot reasonably be deferred to the tape drive.
Hope this clarifies things.
Subj: Ltr to Doug Hagerman
There is a mechanism that has been discussed that would allow you
to find out whether certain commands have completed. The idea is to have
a log in the device that records the completion status of all commands.
Then an initiator could periodically poll the device and find out what
the status is of previously issued commands. The difficulty is that this
is a moderately large perturbation to the current way things work.
However, I'm still confused about your exact difficulty. Suppose we had
the two-phase status you proposed, the first being the Class 2 ACK and
the second being the SCSI status. The ACK indicates that the command
and data were successfully moved to the device, and the good status indicates
that the data was successfully written to the media.
What this allows you to do is issue a command, send the data, wait for
the ACK, then send the next command. My question is, how is this better
than the current situation? Suppose the media write operation fails. At
that point you have a bunch of commands already at the device, so the
cleanup operation is complicated. Isn't this complication exactly the
same as you would get if a deferred error occurred in the current proposal?
Subj: RE: more on tapes etc.
The recovery from errors would be complex, and if a command fails and
several more are in the pipe, we would indeed have to wait till all had
finished so's to drain things, possibly sending command(s) to quiesce
the tape, read position, then figure what to do to move it...or
decide the tape was toast and maybe arrange some slightly higher layer
to change tapes. (I'm working on a failover scheme for VMS SCSI
now, which uses what amounts to a streams approach. Once that gets
accepted, it might be extended to other devices than disks and
perhaps turn into a slightly more generic interface that users might
be told about. ["slightly" because it's fairly generic as it is.])
The advantage as I see it to having the early "I got the command"
ack is that we know we need to write the tape sequentially, and
can start that up faster if we get the class 2 ack. Also, wer'll
we'll keep the I/O processes running till we get a status to report
to the user. If a failure occurs before then, we still have the data
and can retry operations. If on the other hand the status is a bogus
"ok" and a deferred error occurs later, we have by then told the
application that things are OK, because that is what we got from
the device. A deferred error is too late in that the I/O is now
long gone and the app has been told things were OK (and may have
done Lord knows what to the only copy of the data left, thinking
it could get it from the tape again).
Yes, it is possible to just say we won't return the status to the user
or complete the I/O till we do a polling operation on the tape to
be sure the data really got back and then complete the operations. It
is a serious change though, and means that your I/O rate is now
dependent on the clock and on how many operations you can buffer,
and you'd better be able to get the extra commands turned around
even in tight-buffer situations. I also recoil somewhat instinctively
over the notion of having to poll I/O devices to know when I/O is
complete. Seems that you're taking the complexity out of the device
on one hand by using class 3, then putting it back in by requiring
both asynch notice of failure and this polling, and changing the meaning
of the SCSI status in the tape device model (and Lord knows what other
models) all at once. I find this seriously unpleasant. It is less of
an issue with the [gag] Seagate disks that can't generate status of
transmission so fast because it is at least the case that disk
operations are idempotent and can just be retried. (I still think
some error paths will get on average a lot longer than they might
otherwise be because we won't know as much about what might be at
the device and what might be in the fabric, and all we can do is
wait and hope things drain off. We keep track of what is sent and
what is complete, but can't track what's at the device and what's
in the pipe (for fabric, I'm thinking) if no status is available
ever (unless again one adds a polling scheme). Ack of packets is
after all not too important, but acks of sequences could let us
decide that the commands are all at the device, so it's now OK
to tell the device to flush 'em all for stuff like cluster transitions
or when doing some other processing that really needs to have one
initiator (yeah, I know the terminology changed, though I don't
see why this had to be so...) control things exclusively. If errors
never hit, it's not such an issue, but our experience is that they
do at times.
I don't expect tape vendors to say much, since they are perhaps not
used to systems concerns where one tries to develop an abstraction
for the tape that includes the ability to perform error recovery
that may go much further than a single drive can. VMS Backup
scratches the surface of what is reasonable and increasingly needed
in really large shops. Some of the HSM folks get deeper. Key
is the notion that it is NOT OK to say "if the tape fails, the
cartridge is toast and we'll just tell the system that it failed
awhile ago." The data must be preserved and known to be so (recapping
some of my letter today 'cause I'm keeping a copy of this one) so that
more elaborate schemes can be used and errors recovered from to the
best ability of the tapes or the SYSTEMS. Drive vendors tend to be
selling most tapes to small systems (PC class) for tapes as well
as disks, and at that level, yeah, it's ok to junk a cartridge with
a bad spot. If you're running security logs, dbms logs, or the like
to tape, losing the logs can be real trouble. But it's the systems
folks who see that, not the drive vendors. I can imagine some moderately
ugly kludges that could be used if an underlying tape abstraction were
along the lines of "you get notified of failure within maybe 100 tape
records, or 1000, or ..., of the point where the data failure hit."
They'd be ugly, would hurt performance of the system by requiring
additional data streams and management thereof, and would require
some really grotesque rules for handling media (since the valid
length would have to be kept somewhere else). I just don't see why
anyone with a system of scale larger than a few PCs would want to
use such things. I'd rather avoid having to mess with 'em.
Anyhow, the foregoing are my concerns. I hope this all helps you see
where my head is at any rate.
More information about the T10