94-199R1 -- REVISED Minutes of the XOR Study Group Meeting

Milton Scritsmier milton at gastric.arraytech.com
Wed Nov 2 12:14:06 PST 1994


Paul Hodges wrote:
> 
> on September 12, 1994
> Reply-To: phodges at vnet.ibm.com
> News-Software: UReply 3.1
> X-X-From: phodges at vnet.ibm.com(Paul Hodges)
> 
> 
> In the reference, Gerry Houlder wrote
> >
> > (5) Multi-controller data validation problem - Paul Hodges (IBM) posed
> > the problem of one initiator doing an update write (XDWRITE with an
> > XPWRITE to another drive) while another initiator is doing a
> > regenerate on the same LBAs.  The regenerate operation could read new
> > data from the data drive (because XDWRITE is done or a cache hit on
> > new data occurs) and get old data from the parity drive (because
> > XPWRITE hasn't happened yet).
> >
> > Our conclusion is that this problem is not unique to XOR command
> > architectures and can only be solved by having RAID controllers
> > cooperate with each other on such activities.  We didn't identify any
> > particular implementation rules that should be added to the XOR
> > commands.
> >
> 
> Unfortunately, I obscured the real problem by presenting it in the
> context of multiple controllers.
> 
> It is indeed true that the problem of multiple RAID controllers is one
> that the controller design must solve.  In fact, most, if not all,
> subsystems with multiple RAID controllers apply restrictions so that
> the drives of a single parity group are controlled by a single RAID
> controller.
> 
> On the other hand, the Drive-XOR proposal is inherently one of
> multiple initiators that do not coordinate their operations.  The
> sequence of operations at a drive is not controlled by a single
> intelligence, and the condition described in the above example can be
> the result of successive operations from the same controller.
> Therefore, the data integrity exposure exists in the single controller
> environment.

But the controller ultimately coordinates which XORs are being done by the
devices on the bus. It simply needs to make sure that any outstanding XORs
do not trample on one another. I think you will find that most RAID
controllers today have to deal with the same issue even when they control
all XORs directly. After all, a RAID controller has no control over what
commands are received from the host which if executed with no regard to
outstanding operations still going on could cause the same data integrity
problems.

In the example above, if the controller has issued a bunch of XDWRITES
during a regeneration and then discovers it needs to do an update write in
the same area, it simply holds off issuing the XDWRITES for the update
until the regeneration operations for that area have completed (obviously
you don't want to have the regeneration operation do whole disks with one
set of XOR commands). On our RAID controller using normal writes we do
pretty much the same thing.

> There may be scenarios other than the one identified above that can
> cause incorrect data to be sent to the application.
> 

I agree, particularly in the area of dealing with drive errors during the
XORs.

      Milton Scritsmier
      Array Technology, a division of EMC
      
      milton at arraytech.com






More information about the T10 mailing list