Predicitive Failure Code

Jim McGrath jmcgrath at mail.qntm.com
Wed Oct 12 10:42:32 PDT 1994


        Reply to:   RE>Predicitive Failure Codes

George,

I would be interested to hear how your implementation handles the
cases of false negatives and false positive.  A false negative is the
lack of a PFA report when a drive really is about to fail.  A false
positive is a PFA report when the drive is really healthy.  Since you
are making predictions with some probably of error, you always
have false reports.  In general, lowering the probability of one
type of false report increases the chances of the other (i.e. a being
conservative in over reporting the chances of failure lowers the
chances of a false negative, but raises it for a false positive).

Our problem has always been that for a general purpose solution
you need the environment (i.e. system) to tell you about its
sensitivity to these types of false reports.  As an extreme
example, a home PC user may be very sensitive to false
positives (false alarms cost him money and in practice makes
him disregard the report over time - look at car alarms
as an example).  But an MIS guy with a server may be very
careful and want false positives - as long as it reduces the
chances of a false negative.  Or you could reverse the logic -
since a home PC user is less careful about backup, false positives
could be useful if taken as a "backup warning" (i.e. backup
right now, even though you do not usually do it).  But an
MIS guy with regular backups and on-line redundancy (e.g.
some level of RAID) may not want a whole lot of false positives.

In any event, how can a drive tell the difference?  You have to be
able to set trigger points (as LOG SENSE/LOG SELECT allows)
for the device.  And this has always been difficult to do.

Note that if you are a single company shipping into a defined
market and using drives from one vendor, the problem is a lot
easier.  DEC use to do this stuff at the system level in their
mini computers (probably still do), and I assume some IBM
systems do as well.  But I do not think scaling up that
solution is very easy.  If you have data on false positives
and false negatives that would help us, but our data indicates
this is a hard problem.


Jim




--------------------------------------
Date: 10/11/94 9:29 AM
To: Jim McGrath
From: George Penokie
Tom,

Many of IBMs 3.5" SCSI drives will predict there failure.  This feature
has been in IBM drives for at least five years and it does work in the
the real world.  I know this because I am a user of IBM drives and have
seen PFAs work as advertized.

We have, also, incorperated PFA functions in our RAID subsystems (9337, 3514,
and 7137).

The ASC/ASCQ that informs a system that the target has detected a PFA
is 5D00.  That code was voted to be incorperated into SCSI-3 around March of
of 1991.  The document number is X3T9.2/91-027 Rev 1 and it is the current
version of the SPC working draft.

There is a group forming to further expand and define PFA.  There should
be a meeting notice coming accross the reflector by Oct. 14.

Bye for now,






More information about the T10 mailing list