From: James McGrath X3T9.2/91-22 Rev 0 Quantum Corporation 580 Cottonwood Ave Milpitas, CA 95035 To: John Lohmeyer Chairman, X3T9.2 3718 N. Rock Road Wichita, KS 67226 Date: February 18, 1991 Subject: Requirements for a Diagnostic Command Set In devising a "diagnostic" capability within the confines of the SCSI interface, Quantum feels that it is very important to first agree on both the functionality required by users and the feasibility of providing this functionality. This note is intended to explore the design space for a "diagnostic command set" (DCS) with respect to these constraints. Recommendations are then made as to future committee activity within this area. In this document a "vendor" is the manufacturer of a SCSI device; a "customer" is the manufacturer of a computer system incorporating a SCSI device; an "end user" is the final user of the system produced by the customer. Users of Diagnostic Functionality Almost all SCSI devices have some provision for the access and test of individual components. The specific implementations are driven both by the technology used to implement the device and the trade offs made by the manufacturer between reliance upon testing within the assembled device vs test equipment assisted testing/isolated component testing. Although manufacturers frequently attempt to preserve this functionality across products, this is primarily due to a reluctance to change the manufacturing process. This level of functionality is inherently vendor unique. The diagnostic functions change in response to evolving technologies (e.g. embedded servos, zone bit recording, caching, ECC on the fly). More fundamentally, the implementations are very dependent upon the manufacturing/failure analysis process used for the product. Some products will have extensive embedded hardware/software support for manufacturing. Others will have minimal embedded functionality, instead relying upon component testing and system testing with the assistance of external hardware. Given the rapid pace of change and complex economic tradeoffs made in the implementations of this functionality, a standard implementation is neither easy nor desirable. A customer's requirements for diagnostic functionality fall into 3 major categories - product evaluation, system manufacturing, and field diagnostics. In product evaluation a dedicated set of engineers test samples of the product for functionality and design margin. An extensive testing process is employed to accurately measure design margins (e.g. 4 corner test). This data is then used to confirm a manufacturer's manufacturability and reliability claims. Usually evaluation testing is performed by using the standard SCSI interface and varying the environmental conditions. Some customers have requested the ability to "get under the hood" of the SCSI interface and test device components themselves. My experience is that the specifics targeted by customers differ widely based upon the device technology, vendor, and customer. Usually device vendors either provide some of their own manufacturing commands to the customer or implement a diagnostic command set specified by the customer. In manufacturing computer systems testing must be performed on each device assembled into a system. At a minimum testing must detect problems in assembly of the device (e.g. connector problems). At the extreme the customer may require an elaborate incoming inspection process for the device. Here there is some potential for standardization. Many disk drive manufacturers have developed self testing capabilities that perform standard tests (e.g. butterfly seeking, drive data verification). Standardization does have some limits, since customers may have different requirements based upon their own manufacturing process. Finally, end users want to be able to test their systems, detecting if any components (e.g. the disk drive) have failed or are about to fail. Many failure modes related to mass storage have nothing at all to do with the functioning of the disk drive (e.g. operator or system software failures). Some failure modes related to disk drive hardware cannot be adequately handled at the drive level (e.g. loose connectors, bad drive PCB). Finally, some drive hardware failures can be detected and the user notified either before or after the failure. Only the last category of failures can be standardized. Minicomputer and mainframe companies have had extensive experience predicting drive failures. The models they use incorporate extensive assumptions regarding the technology used to implement the device. This implies that the intelligence to predict failures can be profitably incorporated into the disk drive itself. The liability is that false alarms by the drive can cause extensive end user and customer anguish. The high cost of losing data, the dedication of MIS personnel to provide insulation from non-technical users, the high ratio of support personal to systems, and the fact that field service is done by the original manufacturer all make this acceptable for current minicomputer and mainframe customers. It is not at all clear that this would be an acceptable liability in the personnel computer market. Severity of Customer Concerns In general customers are more concerned about field service problems than manufacturing difficulties; more concerned about manufacturing difficulties than the evaluation process. To some degree this attitude is dictated by the recurring costs. The potential for field service problems is present throughout the lifetime of the product. Manufacturing problems could occur for years and in every system produced. Evaluation difficulties are limited in time (6 months) and extent (a few hundred devices). This attitude is also due to the degree of control exercised by the customer. Since customers have the least control over end users, they are most concerned with failures. In system manufacturing they have more control, but competing priorities can lead to manufacturing systems that are not optimized to detect device failures. Only in product evaluation do customers have the resources, the time, the dedication of purpose, and (in practice) the active support of the vendor. This degree of control inspires confidence that problems can be solved as they arise. Finally, customers are much more sensitive to failures among end users since these failures can give their product a bad reputation. Failures in manufacturing can reduce yield, which is a serious but lesser concern. Failures during evaluation are, realistically, expected. It is why evaluations are conducted. Customers protect themselves by considering multiple vendors. Recommendations Given these limitations and the concerns of customers, the current diagnostic command set effort is focused on precisely the wrong areas. Rather than standardizing the evaluation process, more attention should be focused upon incoming inspection, ongoing reliability testing, and field service detection and anticipation of device failures. Concretely, self testing for incoming inspection and ongoing reliability testing can leverage off of features already implemented by many manufacturers (e.g. Conner, Seagate, and Quantum). The object of standardization would be the method of test activation, the general degree of test coverage, and the method of reporting test results. Note that since the primary need for standardization is to detect gross failures in a high volume manufacturing environment, very low level and product specific failure mode data can still be vendor unique - only the detection of a gross level of failure needs to be standardized. Thus drive manufacturers would still be able to implement tests most applicable for their technology and level of drive intelligence. Note that this approach was previously recommended by Bob Snively. Providing a standard way to report possible device failures for field failures could be done by leveraging off the work already done for LOG SELECT/LOG SENSE. Some standard form of alert, specifying both the anticipated severity, probability, and time to failure must be provided. Some ability for the user to specify their sensitivity to false alarms vs lack of timely failure alert should be provided. Finally, the implementation mechanism - LOG SENSE/LOG SELECT - might be made easier to encourage its use.