From:	James McGrath				X3T9.2/91-22  Rev 0
	Quantum Corporation
	580 Cottonwood Ave
	Milpitas, CA 95035

To:	John Lohmeyer
	Chairman, X3T9.2
	3718 N. Rock Road
	Wichita, KS 67226

Date:	February 18, 1991

Subject: Requirements for a Diagnostic Command Set




In devising a "diagnostic" capability within the confines of the SCSI
interface, Quantum feels that it is very important to first agree on
both the functionality required by users and the feasibility of
providing this functionality.  This note is intended to explore the
design space for a "diagnostic command set" (DCS) with respect to
these constraints.  Recommendations are then made as to future
committee activity within this area.

In this document a "vendor" is the manufacturer of a SCSI device; a
"customer" is the manufacturer of a computer system incorporating a
SCSI device; an "end user" is the final user of the system produced
by the customer.




Users of Diagnostic Functionality

    Almost all SCSI devices have some provision for the access and
    test of individual components.  The specific implementations are
    driven both by the technology used to implement the device and
    the trade offs made by the manufacturer between reliance upon
    testing within the assembled device vs test equipment assisted
    testing/isolated component testing.  Although manufacturers
    frequently attempt to preserve this functionality across
    products, this is primarily due to a reluctance to change the
    manufacturing process.

    This level of functionality is inherently vendor unique.  The
    diagnostic functions change in response to evolving
    technologies (e.g. embedded servos, zone bit recording, caching,
    ECC on the fly).  More fundamentally, the implementations are very
    dependent upon the manufacturing/failure analysis process used
    for the product.  Some products will have extensive embedded
    hardware/software support for manufacturing.  Others will have
    minimal embedded functionality, instead relying upon component
    testing and system testing with the assistance of external
    hardware.  Given the rapid pace of change and complex economic
    tradeoffs made in the implementations of this functionality, a
    standard implementation is neither easy nor desirable.

    A customer's requirements for diagnostic functionality fall into 3
    major categories - product evaluation, system manufacturing, and
    field diagnostics.  In product evaluation a dedicated set of
    engineers test samples of the product for functionality and
    design margin.  An extensive testing process is employed to
    accurately measure design margins (e.g. 4 corner test).  This
    data is then used to confirm a manufacturer's manufacturability
    and reliability claims.

    Usually evaluation testing is performed by using the standard SCSI
    interface and varying the environmental conditions.  Some
    customers have requested the ability to "get under the hood" of
    the SCSI interface and test device components themselves.  My
    experience is that the specifics targeted by customers differ
    widely based upon the device technology, vendor, and customer.
    Usually device vendors either provide some of their own
    manufacturing commands to the customer or implement a diagnostic
    command set specified by the customer.

    In manufacturing computer systems testing must be performed on each
    device assembled into a system.  At a minimum testing must detect
    problems in assembly of the device (e.g. connector problems).  At
    the extreme the customer may require an elaborate incoming
    inspection process for the device.  Here there is some potential for
    standardization.  Many disk drive manufacturers have developed
    self testing capabilities that perform standard tests (e.g.
    butterfly seeking, drive data verification).  Standardization
    does have some limits, since customers may have different
    requirements based upon their own manufacturing process.

    Finally, end users want to be able to test their systems, detecting
    if any components (e.g. the disk drive) have failed or are about
    to fail.  Many failure modes related to mass storage have nothing
    at all to do with the functioning of the disk drive (e.g.
    operator or system software failures).  Some failure modes
    related to disk drive hardware cannot be adequately handled at
    the drive level (e.g. loose connectors, bad drive PCB).  Finally,
    some drive hardware failures can be detected and the user notified
    either before or after the failure.

    Only the last category of failures can be standardized.
    Minicomputer and mainframe companies have had extensive
    experience predicting drive failures.  The models they use
    incorporate extensive assumptions regarding the technology used
    to implement the device.  This implies that the intelligence to
    predict failures can be profitably incorporated into the disk
    drive itself.

    The liability is that false alarms by the drive can cause extensive
    end user and customer anguish.  The high cost of losing data, the
    dedication of MIS personnel to provide insulation from
    non-technical users, the high ratio of support personal to
    systems, and the fact that field service is done by the original
    manufacturer all make this acceptable for current minicomputer
    and mainframe customers.  It is not at all clear that this would
    be an acceptable liability in the personnel computer market.




Severity of Customer Concerns

    In general customers are more concerned about field service problems
    than manufacturing difficulties; more concerned about
    manufacturing difficulties than the evaluation process.  To some
    degree this attitude is dictated by the recurring costs.  The
    potential for field service problems is present throughout the
    lifetime of the product.  Manufacturing problems could occur for
    years and in every system produced.  Evaluation difficulties are
    limited in time (6 months) and extent (a few hundred devices).

    This attitude is also due to the degree of control exercised by the
    customer.  Since customers have the least control over end users,
    they are most concerned with failures. In system manufacturing
    they have more control, but competing priorities can lead to
    manufacturing systems that are not optimized to detect device
    failures.  Only in product evaluation do customers have the
    resources, the time, the dedication of purpose, and (in practice)
    the active support of the vendor.  This degree of control
    inspires confidence that problems can be solved as they arise.

    Finally, customers are much more sensitive to failures among end
    users since these failures can give their product a bad
    reputation.  Failures in manufacturing can reduce yield, which is
    a serious but lesser concern.  Failures during evaluation are,
    realistically, expected.  It is why evaluations are conducted.
    Customers protect themselves by considering multiple vendors. 




Recommendations

    Given these limitations and the concerns of customers, the current
    diagnostic command set effort is focused on precisely the wrong
    areas.  Rather than standardizing the evaluation process, more
    attention should be focused upon incoming inspection, ongoing
    reliability testing, and field service detection and anticipation
    of device failures.

    Concretely, self testing for incoming inspection and ongoing
    reliability testing can leverage off of features already
    implemented by many manufacturers (e.g. Conner, Seagate, and
    Quantum).  The object of standardization would be the method
    of test activation, the general degree of test coverage, and
    the method of reporting test results.  Note that since the
    primary need for standardization is to detect gross failures
    in a high volume manufacturing environment, very low level
    and product specific failure mode data can still be vendor
    unique - only the detection of a gross level of failure needs
    to be standardized.  Thus drive manufacturers would still be
    able to implement tests most applicable for their technology
    and level of drive intelligence.  Note that this approach was
    previously recommended by Bob Snively.

    Providing a standard way to report possible device failures for
    field failures could be done by leveraging off the work already
    done for LOG SELECT/LOG SENSE.  Some standard form of alert,
    specifying both the anticipated severity, probability, and time
    to failure must be provided.  Some ability for the user to
    specify their sensitivity to false alarms vs lack of timely
    failure alert should be provided.  Finally, the implementation
    mechanism - LOG SENSE/LOG SELECT - might be made easier to
    encourage its use.