Question about SCSI consistency model

Vladislav Bolkhovitin vst at vlnb.net
Wed Nov 13 00:34:44 PST 2013


* From the T10 Reflector (t10 at t10.org), posted by:
* Vladislav Bolkhovitin <vst at vlnb.net>
*
Black, David, on 11/11/2013 09:44 PM wrote:
>> Would it make sense if we submit a formal proposal to add a bit in the
>> Control mode page saying that if it's set, for not completed WRITE
>> commands after power loss different I_T nexuses can return mix of old
>> and new data for blocks belonging to the not completed WRITEs?
>
> You can propose what you like, but there are very important uses of shared
> SCSI storage that this will break.
Yes, sure, this is why additional Control mode page bit is proposed. It 
doesn't have to be set by all SCSI devices.
> Also see the current work on the atomic write commands (13-064)  which is
> headed in the opposite direction based on a database motivation.
Thank you for pointing out on it. But I don't think this proposal is in 
the opposite direction. This proposal is about improving write 
performance of distributed SCSI devices for some applications, incliding 
journaling application, while the atomic writes proposal is about 
removing the need of journaling. Both proposals are going in the same 
direction, improving performance, but for the used means they are rather 
orthogonal, than opposing.
>> Such behavior is fully OK for modern journaled database systems, which
>> on recovery either retry those not completed WRITEs after timeout right
>> away, or, if the device was disconnected or they crashed at the same
>> time as well, replay journal, i.e. also retry the not completed WRITEs.
>
> That statement assumes that the database block size is equal to the SCSI
> logical block size - needless to say, that is a bad assumption.
Thank you for commenting, but I can't see this assumption. The only 
assumption I can see is typical journaling implementation: (1) journal 
is written sequentially, (2) on replay it's read sequentially from the 
beginning only once and (3) replay stops as soon as some "old" data 
detected. It should work for any database block size even if several 
paths are aggregated in a round-robin manner or failover happened during 
journal replay.
Thanks,
Vlad
> Thanks,
> --David
>
>> -----Original Message-----
>> From: owner-t10 at t10.org [mailto:owner-t10 at t10.org] On Behalf Of Vladislav
>> Bolkhovitin
>> Sent: Monday, November 11, 2013 4:27 PM
>> To: Knight, Frederick
>> Cc: t10 at t10.org
>> Subject: Re: Question about SCSI consistency model
>>
>> * From the T10 Reflector (t10 at t10.org), posted by:
>> * Vladislav Bolkhovitin <vst at vlnb.net>
>> *
>> I see, thank you for your reply and detail explanation.
>>
>> But, apparently, there is a demand for such looser consistensy model.
>> One colleague told me that when he was working in IBM they were
>> considering such optimization as well.
>>
>> Would it make sense if we submit a formal proposal to add a bit in the
>> Control mode page saying that if it's set, for not completed WRITE
>> commands after power loss different I_T nexuses can return mix of old
>> and new data for blocks belonging to the not completed WRITEs?
>>
>> Such behavior is fully OK for modern journaled database systems, which
>> on recovery either retry those not completed WRITEs after timeout right
>> away, or, if the device was disconnected or they crashed at the same
>> time as well, replay journal, i.e. also retry the not completed WRITEs.
>>
>> Vlad
>>
>> Knight, Frederick, on 10/26/2013 09:07 AM wrote:
>>> I don't believe that interpretation is valid.  Notice that the text you
>> quote mentions nothing about the path.  It only mentions the LBAs (which
are
>> just addresses), and the data contained at that address.
>>>
>>> If you process a WRITE command to an address (to LBAs), and some data is
>> written into those LBAs, but not all of the data is written, then a READ
(to
>> that same address) may return some of the new data that was written, and
some
>> of the old data that didn't get replaced yet, or any combination. 
Remember,
>> it is the device server doing this processing, not the path.  There is no
such
>> thing as having multiple device servers.  Look at the SAM model, and you
find
>> lots of target ports, but a single logical unit, a single task router, and
a
>> single device server.
>>>
>>> Consider a WRITE to addresses 101-110 (LBAs 101-110).  Consider a failure
>> where that write successfully puts data into LBAs 105-110, but encounters
an
>> error before any of the other data can be written to persistent storage. 
For
>> this failure case, I would expect a READ of LBA 101-110 to return the
original
>> old data from LBAs 101-104 and then return the new data from LBAs 105-110.
>> But, the text you are quoting allows other behaviors as well.
>>>
>>> What I do NOT find in that text, is permission to return DIFFERENT data
for
>> different READ commands to the same LBA, just because those different READ
>> commands happen to come into the logical unit/task router/device server
via
>> different target ports (again, look at the SAM model).  The LBA is the
address
>> of the data; and the data is the data (singular).  One address (one LBA)
can't
>> have multiple DIFFERENT data values.
>>>
>>>	Fred Knight
>>>
>>>
>>> -----Original Message-----
>>> From: owner-t10 at t10.org [mailto:owner-t10 at t10.org] On Behalf Of Vladislav
>> Bolkhovitin
>>> Sent: Thursday, October 24, 2013 1:23 PM
>>> To: t10 at t10.org
>>> Subject: Question about SCSI consistency model
>>>
>>> * From the T10 Reflector (t10 at t10.org), posted by:
>>> * Vladislav Bolkhovitin <vst at vlnb.net>
>>> *
>>> Hello,
>>>
>>> We are working on creating a distributed SCSI device, when several nodes
are
>> combined together to create something, which looks as a single multipath
SCSI
>> device to initiators, where each path is the path to separate node. We
figured
>> out that fully exploiting SCSI consistency model would allow us to
>> significantly improve performance, but we wonder if the SCSI consistency
model
>> is going AS far as we need.
>>>
>>> SPC-3 section "Write and unmap failures" says:
>>>
>>>
>>
_____________________________________________________________________________
_
>> ____
>>>
>>> If one or more write commands are have not completed when a power loss
>> occurs (e.g., resulting in a vendor specific command timeout by the
>> application client) or a medium error or hardware error occurs (e.g.,
because
>> a removable medium was incorrectly undemounted), then any data in the
logical
>> blocks referenced by the LBAs specified by any of those commands is
>> indeterminate. Before sending a read command or verify command specifying
any
>> LBAs that were specified by one of the write commands that did not
complete,
>> the application client should resend that write command. If an application
>> client sends a read command or verify command specifying any LBAs that
were
>> specified by one of the write commands that did not complete before
resending
>> that write command, then the device server may return old data, new data,
>> vendor-specific data, or any combination thereof for the logical blocks
>> referenced by the specified LBAs
>>>
>>>
>>
_____________________________________________________________________________
_
>> ____
>>>
>>>
>>> The question is if the device server after a failure of a write command
on
>> block X starts returning on reads from this block from one path - old data
and
>> from another path - new data, would it still be in line with the above
SCSI
>> consistency model?
>>>
>>> Thanks,
>>> Vlad
>>>
>>> *
>>> * For T10 Reflector information, send a message with
>>> * 'info t10' (no quotes) in the message body to majordomo at t10.org
>>>
>>
>>
>> *
>> * For T10 Reflector information, send a message with
>> * 'info t10' (no quotes) in the message body to majordomo at t10.org
>
>
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org



More information about the T10 mailing list