Drive XOR -- proposal for distributed XOR

Gerry Houlder Gerry_Houlder at notes.seagate.com
Wed Mar 29 05:13:31 PST 1995


This is a comment on Paul Hodges' proposal to distribute XOR operations for a 
regenerate or rebuild operation across all drives involved in the operation. He 
sees some advantages in this, but I see some disadvantages that need to be 
addressed:

1) ERROR REPORTING PROBLEM: Each drive has to decode its command, parse the 
parameter list, do its operation, and send a new command to another drive. This 
is a lot of "third party" operations. What if the last drive in the chain 
detects an error? How should the error be reported back through the entire 
string of drives involved in the operation? We had a lot of difficulty 
accepting the concept of one "third party" drive and its error reporting 
problems. This situation is many times worse.

2) PERFORMANCE PROBLEM: Having each drive, in turn, do a part of the XOR 
operation forces a specific order to the XOR operations. Each operation will 
have an average rotational latency delay as well as the command overhead. It 
doesn't seem that performance can be improved by doing operations in parallel 
(like at least the read from disk parts). In the current proposal, the third 
party drive can issue all of the read commands (so all are outstanding at the 
same time) and process them in the order that they reconnect with their data. 
This allows the drive rotational latencies to overlap. This has far more 
potential for improved performance.

3) REGENERATE WORKLOAD SHARING: For regenerate operations, the array controller 
can already force the work to be distributed across all of the drives. For 
example, in a 5 drive array with a failed drive (leaving 4 functional drives) 
the array controller can rotate between the 4 functional drives in choosing the 
"third party initiator" for each regenerate command. This distributes the 
workload so each drive does one fourth of the regenerate operations. This is 
safe and performance effective as long as none of the outstanding regenerate 
commands overlap the same address space. This shares the workload without 
needing any changes to the existing proposal.

4) REBUILD WORKLOAD SHARING: For rebuild operations, the drive doing the 
operation is always the replacement drive. This drive doesn't have valid data 
on it yet, so it is reasonable that it should spend all of its time rebuilding 
itself. If this activity was shared with other drives in the array, it would 
hurt the other drives ability to perform their primary job, which is the 
delivery of customer data to the array controller. I think forcing most of the 
overhead (and responsibility) onto the replacement drive is an advantage, not a 
problem.

5) DETERMINING ERROR RESPONSIBILITY: With the distributed XOR operations 
proposed, each drive must accept a parameter list and data and send a different 
parameter list and data to the next drive. If an error crops up in the 
parameter list it could have been caused by any of the preceding drives because 
all of them manipulate and modify that list. The current proposal doesn't 
require drives to modify any parameter list or transmit a parameter list to 
another drive. If an error occurs in the list, only the array controller, the 
third party drive, and the bus can be blamed for any improper results.

I believe we need more discussion on why this kind of distributing workload 
across all of the drives is a good thing. I see a lot of disadvantages to the 
new proposal and ways to gain the same advantages with the current proposal.




More information about the T10 mailing list