Notes on OBSD protocol standard text

Richard Golding golding at panasas.com
Tue Sep 5 16:30:59 PDT 2000


* From the T10 Reflector (t10 at t10.org), posted by:
* Richard Golding <golding at panasas.com>
*
This document summarizes some discussions that Jim Hafner, Randal Burns,
and I have had over the weeks since the Colorado Springs T10/OSD WG
meeting.  This document is structured as a list of the major topic areas
we have discussed.  I have tried to represent their positions; any
errors are my responsibility.

Our intent has been to define a small core of functionality, leaving
out the fancier or more complicated bits or at least presenting them
explicitly as separate extensions.  The goal in doing so is to help
reach consensus about the core without derailing discussion over
potentially more contentious features.

Topic areas

* Bidirectional data transfer (data in and out on one command)

  This is seen as an essential requirement for building a useful OSD
  standard.  We have continued the design assuming it is available.

* Open/close vs. sessions

  There are two separable issues: establishing sessions that are a
  (potential) reservation of storage device resources, and sessions
  that indicate usage patterns on an object.  The former should be
  called sessions; this is a separate discussion topic below.  The
  open and close operations should then be hints, indicating to cache
  management mechanisms which objects are likely/unlikely to be used
  in the near future.

  In the NSIC/NASD group, there was much debate over this.  Some
  wanted an open/close operation in the standard, so that an OSD would
  know what objects were active and hence needed resources allocated
  for them.  It could also pass some more detailed hints about
  expected access pattern for the specific object -- read only, write
  only, sequential, random, ... The CMU position was that open/close
  weren't needed; the CMU NASD system worked just fine without it.
  The compromise was that open and close should be hints that a client
  can send to an OSD to suggest that an object was about to see use.
  The open-as-hint compromise dissociates these operations from any
  notion of session: a client can access an object without opening it
  if it wants to.  Under ordinary circumstances each open should be
  matched by a close, and that the recommended implementation be that
  the OSD maintain an "open count" on each object that is incremented
  on each open request and decremented on each close.  However, an OSD
  should be free to ignore an open (or close) if it chooses to, and to
  toss out any information it may have on an open any time it wants.
  A close that the OSD cannot match up to an open could give an
  informational return to the effect that the close didn't match an
  open, but this should not be taken to indicate an error condition.
  There might be value in including some kind of tag with an open that
  an initiator could send with a close to help match the close to the
  open.

* Lock/unlock

  We suggest that this is best left to an optional extension.

  Doing concurrency control well in a distributed system -- the
  environment in which I expect OSDs to see most use -- is a
  complicated issue.

* Object allocation size, logical size, and end-of-object

  There are two separate "size" values for an object:

  - allocation size: the amount of media resource the object consumes.

  - logical length: the number of bytes in the objects logical address
    space.

  The allocation size is affected by creating the object, writing to
  it, and truncating its logical address space.  It isn't set directly
  by the initiator; rather, it is a side-effect of the internal
  allocation actions that an OSD may take.  (On a side note,
  supporting a sparse address space, or "holes", is important.)  This
  number does not have to be exact and can change without an initiator
  doing anything (e.g. when an OSD reorganizes blocks on media,
  increasing extent size and reducing the amount of metadata stored
  for the object).

  The logical length is intended to be the value that would be
  reported for "end of object".  It is measured in bytes.  The logical
  length is not an indication of the physical space allocated to the
  object.  The following invariants should hold:

    logical length >= offset of all bytes of data written to the
                      object + 1
  When n bytes of data are written at offset f,

    logical length := max(object_size, f+n+1)

  Holes in the data are allowed, and all data in the object are
  initially zero.

  An append writes data beginning at logical length.

  A write past the logical length extends the logical length to cover
  the newly-written data.

  Add a "clear" command that can decrease the logical length of the
  object (or introduce holes in the data).  Any data in a cleared
  region become zero.

  When a read occurs, logical length doesn't come into play.  For the
  regions being read that have written data in them, those data are
  returned.  For the regions that don't have written data in them
  (holes, or anything beyond the logical length) all zero bytes are
  returned.

  When an object is created, two size values can be supplied:

  1.  An initial value for the logical length.  This would make an
      append performed immediately after the open write its data at
      some offset greater than zero.

  2.  A hint for the expected eventual logical length.  This could be
      used to guide an OSD's internal allocation mechanism (for
      example, by trying to allocate an extent in which the entire
      object could fit.)

  This is roughly Unix file semantics, and is what one would build a
  file system upon.

* Delete of group

  This should include a flag to indicate whether a non-empty group
  should be deleted.  If a non-empty group is deleted, the effect is
  as if each object in the group were deleted, and then the group
  itself deleted.  There is probably an error case where some object
  can't be deleted and so a delete-nonempty-group command will fail.

* Scrubbing

  Any operation that deletes data should include a "scrubbing" option
  bit that will cause the drive to write zeros over the media that was
  associated with the data being deleted.  This won't affect the
  logical presentation of data, nor will it stop the truly determined,
  but some customers want the additional protection.

* Clear all

  Concept: there is an "empty" OSD configuration, consisting of just
  the drive with the drive control object and its attributes, and no
  object groups or objects in them.

  A "clear all" operation deletes all objects and object groups in
  order to reach this state.  The attributes on the device control
  object are retained, though this operation should have a mechanism
  to support changing some attributes atomically with the clear all
  command (e.g. to change security keys.)

* Relationship of OSDs and BSDs

  This is outside the scope of this standard.  It might be nice to
  have a mechanism to convert a disk to an OSD and back again (for
  conservation and migration reasons).  But that can easily be
  vendor-specific.  One customer requirement I'd have (if I was going
  to buy one) is that if this transformation occurs, then the device
  immediately does a low-level format to zero out the physical device.
  Exactly what you want, but it needn't be in any standard, it can be
  customer driven.  It's not clear what all this means in the context
  of a controller, however.

* Compound commands

  This is outside the scope of this standard.  We don't see a
  compelling need for it now.

* Proximity hints

  This is on the borderline of suitability for the core versus being
  an extension.  The argument is that existing server-based file
  systems use a number of heuristics to try to place files together.
  The information that drives these heuristics is not available to an
  OSD, but the performance effects of nearby placement are still
  important.  Providing an attribute that hints at sets of objects
  that should be placed together seems like a valuable extension to
  the core function, but not core function itself.

* Attributes

  Requirements

    - support for inode-like data
    - must not inhibit fast internal access to a few common attributes
    - extensibility; support for revisions of standard
    - little or no overhead when attributes are not needed
    - support for vendor-unique attributes
    - identification and resolution of what attributes are supported
    - doesn't get in the way of parsing in hardware
    - can be serialized for communicating as a command argument

    Note that attributes have an effect on the structure of
    capabilities, and on commands such as LIST.

  Proposal

    - Basic ideas are to group attributes into pages, and associate
      with each page an identification of the standard that defines
      the page.  For example, there could be a timestamp page; upon
      inquiring, an initiator could find out that the timestamp page
      conforms to standard "XYZ-123 rev 29, 2 April 2005".  The
      standard would define the set of attributes on the page, and
      their identities, sizes, formats, and defaults.

    - The basic interface is a set attribute command (sometimes
      piggybacked on another command, such as create) and a get
      attribute command.  The data transferred in or out as a
      parameter on these commands has a general serialized form.  The
      internal structure of the attributes (on the drive) may be quite
      different from the serialized form.

    - The full serialization form is (defined in a regexp-like
      notation):

          serialization := number-of-attributes ( attrib-spec )*

          attrib-spec := attrib-key value

          attrib-key := page attrib-id

          value := value-length bytes[value-length]

          page := 16-bit integer

          attrib-id := 16-bit integer

          value-length := 16-bit integer

          number-of-attributes := 32-bit integer

      The full serialization form is used for sending data down on a
      set attribute command, and for getting data back on a get
      attribute command.  An open question: should this serialization
      form be registered as a standard MIME type?

      There is also a reduced serialization form, which is used to
      send down the list of attributes requested on a get attribute
      command.  It is:

          reduced-serialization := number-of-attributes {
                                     reduced-attrib-spec } *

          reduced-attrib-space := attrib-key

    - attributes 0x0000 and 0xffff within a page are special.
      If the page is supported, attribute 0x0000 is the specification
      version that the page conforms to.  Attribute 0xffff is used as
      a wildcard in some cases.

    - page 0x0000 is a special, mandatory page.  Attribute X on page
      0x0000 also indicates the protocol revision for page X (in
      addition to attribute 0x0000 on page X).  An initiator can list
      the pages that are supported on an object by doing a get
      attributes on page 0x0000, attribute 0xffff (see next bullet)

    - the get attributes command takes as argument a
      reduced-serialization, which specifies the attributes that the
      initiator wants to know about.  If the attribute 0xffff is named
      within a page, all attributes are returned.  This means that the
      number of entries in the input serialization can be much smaller
      than the number of entries in the output serialization.

  Getting the protocol version support value right is important.  It
  should almost certainly have a fixed size, so that the data are easy
  to generate in hardware.  Specification identification standards
  should be based on some international standard, and include space
  for a specification authority.  This allows for both international
  standards supported by major organizations, such as IETF, NCITS, or
  ISO, as well as standards private to vendors or consortia (IBM,
  Panasas, SNIA, ...)

  Attributes should be specifiable on object/group create, device
  reinitialize, and get/set attributes.  The list operation on a
  device control object or group control object should support the
  identification of one attribute to sort on, and possibly a value for
  filtering on that attribute.  Some of the other commands
  (e.g. write, clear) will have side-effects on some attributes.

* Preventing orphaned/leaked objects

  The requirement is that it must be possible to create objects as
  part of a distributed system without leaking (orphaning) them if an
  external agent fails during the creation process.

  The solution approaches we know of so far:

  Externally-specified id: The external agent picks the id for the
      object to be created.  The external agent can log the id before
      sending the create; the transaction can be rolled back by
      removing the object if it exists, or rolled forward by
      completing any other operations (like writing data or setting
      attributes) if the object does exist.

      +: simple
      -: doesn't accommodate multiple external agents well

  Standard 2PC: The drive picks the object id, and returns it to the
      external agent, reserving the id but not making the object
      accessible.  The external agent includes a transaction ID
      attribute with the object creation request.  The external agent
      sends a commit to make the object permanent and accessible.  If
      the external agent fails after getting and logging the object
      id, there is enough information to roll the transaction forward
      later.  If the external agent fails before receiving and logging
      the object id, the transaction can be recovered by querying the
      object group for any uncommitted objects with a particular
      transaction id.

      +: clock-free
      -: means that external agents have to poll drives from time to
         time to find objects that might have been orphaned, and for
         whom log records have been lost

  Expiration timeout: The external agent sends a create request,
      specifying expiration timeout for the object.  The drive picks
      the object id, and sends it to the external agent as
      acknowledgment.  The external agent then changes the timeout to
      infinity, making the object permanent. If the external agent
      fails before changing the timeout, the object will delete itself
      and in effect roll back the transaction.  If the external agent
      fails after changing the timeout, there is enough information to
      roll the transaction forward.

      +: free from periodic polling
      -: requires a clock; deletes objects without external agents
         initiating the deletion

  Drive-activated recovery: The external agent sends a create request,
      specifying that the object is not committed.  The drive picks
      the object id, and sends it to the external agent as
      acknowledgment.  The external agent then performs a commit on
      the object.  If the external agent fails before committing the
      object, the drive will eventually (after some timeout period)
      contact an outside agent and give it the identity of the
      uncommitted object.  If the external agent fails after committing
      the object, there is enough information at the external agent to
      roll the transaction forward.

      +: polling-free; no overhead until failure
      -: requires drives be able to initiate communication with
         somebody (and know who that somebody should be)

  Recommended solution: Implement two-phase creation using existing
      mechanisms.  The creator can set some (perhaps vendor-unique)
      attributes atomically as part of object creation, and the drive
      can presume successful creation.  Some external agent can poll
      the drive from time to time, performing a LIST for objects that
      match the given attribute pattern.  The agent can then determine
      how to roll the transaction backwards or forwards.

* Clock requirements and synchronization

  Requirements

  The clock model is driven by the following requirements.

  1.  Objects have timestamp attributes, which are updated at various
      events in an object's life cycle.  An application reading these
      timestamp values must be able to interpret them to obtain a wall
      clock time that is accurate to with a small delta.  At times,
      agents outside the OSD will read timestamps from different OSDs
      and compare them to determine an actual temporal ordering of
      creation or update.

  2.  (Extension) Security capabilities have expirations, and the
      duration of validity should be equal to the expected duration.

  3.  The interface from the OSD to the outside world, in all of these
      cases, should be in terms of wall clock time rather than a
      private time base.

  4.  OSDs that are part of a network may need accurate clocks in
      order to be good network citizens (e.g. to participate in DHCP).
      These protocols often require only that the OSD be able to
      accurately measure a duration, rather than accurately know the
      specific time.

  Clock model

  The OSD must have a clock that runs at a steady rate while the OSD
  is accepting commands.  It should be possible to map any timestamp
  obtained from the OSD clock to and from wall clock time using a
  simple linear transformation.  This is possible if the clock ticks
  at a steady rate, and there is some time at which the correspondence
  between the OSD clock and wall clock time is known.

  Note that an OSD that employs a clock synchronization mechanism
  meets these criteria -- it has a steady-state drift rate of zero,
  and the OSD clock always corresponds to wall clock time within some
  bound.

  A clock that only ticks while the OSD is powered requires special
  consideration.  After power-up, the OSD must not accept commands
  that would reference the clock -- which includes read and write
  commands -- until the clock has been synchronized.

  Suggested standard

  The standard should only specify the above requirements, rather than
  mandate a specific implementation mechanism.

* Security

  The proposed approach is derived from the CMU NASD security
  approach, with certain simplifications and generalizations.  The
  security appendix in the current OSD document is roughly similar to
  this one; there are, however, a number of simplifications in this
  proposal.

  - The OSD protocol is only concerned with access control.  Privacy
    and integrity are left to the transport protocol on which the OSD
    protocol is layered.

  - The basic approach is to used signed capabilities.  A capability
    grants a certain set of rights on a certain set of objects within
    a single object group, or upon the group or drive as a whole.  A
    capability is valid up to a particular time, and as long as a
    capability version matches, and as long as its signature can be
    verified.

  - The drive has a hierarchy of keys.  The drive as a whole has a
    master and a working key.  Each object group has a master and two
    working keys (A and B).  The hierarchy order is drive master >
    drive working > objgroup master > (objgroup working A = objgroup
    working B).

    A key is changed by a SET KEY command, which requires a capability
    signed by either the previous key value at the same level or by
    any key higher in the hierarchy.  That is, the object group master
    can be changed by somebody who presents a capability signed by the
    previous object group master, or either of the drive keys, but not
    signed by either of the object group working keys.  The two object
    group working keys are considered equal in the hierarchy, so the A
    key can be changed by a cap signed by the B key.

    Typical usage is that the drive and object group master keys are
    changed very rarely.  Copies of the drive master keys are
    maintained outside the system in some secure fashion (e.g. on a
    piece of paper in a locked vault.)  Working keys are changed
    fairly regularly.

  - Capabilities have the following fields:

      the entities on which the capability is valid:

        target_id (unique identifier for the OSD)
        object_group (zero for whole-device keys)
        object_selector

      the actions authorized by the capability:

        action_mask (what actions are permitted by this capability)

      the limitations on and evidence of validity:

        capability_version
        expiration time (relative to the drive's clock)
        key_id
        hash

  - The object selector replaces (and generalizes) the former notion
    of flavors.  The basic notion is that the selector identifies that
    the capability is valid on any object, a particular named object,
    or any object that matches a predicate on its attribute values.

  - The action mask does not explicitly name the commands that are
    approved using the capability; rather, it names a set of general
    privileges including read data, write data, delete data, read
    attribute, write attribute.  Each command will require some subset
    of these privileges.

  - the capability version field of the capability provides for fast
    revocation of all outstanding caps.  Each object (including the
    DCO and GCOs) have a "capability version" attribute.  A cap is
    only valid when the capability version attribute of the object
    addressed equals the capability version field of the cap.  To
    revoke all caps, just change the capability version attribute on
    the object.


* Sessions

  We view sessions as an essential feature for future development of
  OSDs, but our view is that they are going to require a lot of work
  to develop.  Moreover, much of the semantic needed in sessions is
  not specific to an OSD; rather, some aspects of it are best part of
  the transport while others apply to all storage devices.

  Our position is to define a skeleton of a model for sessions,
  sufficient that will allow us to define reserved fields in this
  version of the protocol specification, but not more than that.

  Basic assumptions: the session or transport layer on which the OSD
  protocol is built will provide a session concept, which we refer to
  as a session-layer session or SLS.  The session there will be some
  kind of network connection between an initiator and a target (using
  SCSI terms), and is concerned only with the network communication
  aspects of QoS.  The session may include security qualities
  (privacy, integrity); performance qualities (throughput, latency,
  jitter); and reliability qualities (frequency and duration of
  disconnection, packet loss rate).  Things like communication buffer
  sizing are mechanisms to effect these qualities.

  The application-layer session (ALS hereafter) concept deals with the
  storage media aspects of the stream of commands and data coming over
  an SLS.  The things relevant here include the scoping and
  identification of access patterns (random, sequential, strided),
  separating sequences of commands (e.g. prefetching information
  collected separately for different ALSes), and fair usage of cache
  space and drive bandwidth.

  Multiple ALSes can be multiplexed over one SLS, representing the
  streams of commands from different applications running on one
  initiator.  Each command sent to an OSD will occur in the context of
  some ALS, and each ALS is within the context of an SLS.  The session
  layer defines all of the SLS concepts, and (with one exception that
  will come up on the copy commands, later) does not show up in the
  application-layer protocol.  There is always one ALS within a SLS,
  ALS zero, which represents the default best-effort session.

  For this version of the OSD protocol, we reserve an argument on all
  commands to specify the ALS in which the command occurs.  This
  argument must always be zero in this protocol version.  (In network
  protocol design terms, we are in effect wrapping a degenerate
  message header around the basic command messages.)

  This approach allows us to develop the details of the session
  mechanism without blocking development of the core OSD semantics.


* Copying

  This is an area for extending the core OSD protocol.

  We have defined four different kinds of copying: (intra-
  vs. inter-device) X (single-object vs. object group).  All of these
  combinations have value, and all four should be supported.

  Single object, intra-device: this has been called the fast copy or
  clone operation.  Given the id of a regular (not control) object, it
  creates a new object in the same object group with identical content
  in both attributes and data.  The expectation is that this would be
  implemented using some kind of copy on write mechanism inside the
  drive.  The assumption is that the new object would be writable.  We
  view this as a base mechanism on which file versioning can be
  implemented.

  Object group, intra-device: this is a point-in-time snapshot of an
  entire object group.  It creates a new object group with a new group
  id, with all GCO attributes copied from the old object group's GCO.
  The new object group would contain the same number of objects as the
  old object group, numbered the same way, with the same contents.  It
  is an open question whether the new object group needs to be
  writable or not.  This is a base mechanism on which backup support
  can be implemented.

  Single object, inter-device: this is the import or export of one
  object from one OSD to another.  It is an open question whether this
  should be done as an import or an export -- that is, driven by the
  sender or the receiver of the copied object.  Assuming it is
  implemented as an object export, this operation would stream all of
  the contents of an object out to a target drive, which would create
  a new object.  At the end of the export, the new object would have
  data content identical to the original object; it is preferred that
  sparseness is preserved. The attributes of the new object should be
  generally identical; there are details to work out with creation and
  access timestamps.  This is a base mechanism for implementing
  object-level migration and load balancing.

  Object group, inter-device: this would stream out the contents of an
  entire object group to a target device, creating a new object group
  on the target.  Object ids, attributes, and contents would be
  preserved through the copy, just as with an object group
  intra-device snapshot.  This is a base mechanism for implementing
  backup and restore.

  In all of these cases, the solution we adopt for handling orphaned
  objects on ordinary create should be used (especially when we could
  be orphaning entire object groups).

  It is attractive to define a serialized form of an object, so that
  we have an implementation-neutral way of representing sparse data
  and of resolving potentially variant attribute support.  The
  following is an abstract specification of that serialization (using
  the attribute serialization defined earlier):

          object-serialization := object-id attribute-serialization
                  data-serialization
          object-id := standard OSD object identifier (not including
                  group id)
          attribute-serialization defined earlier
          data-serialization := data-range* end-data-range
          data-range := "data" offset length data[length]
          end-data-range := "end"

          where "data" and "end" are some one-byte record type
          identifiers. The data ranges can be transmitted in any order
          but may not overlap.

  Copying and sessions

  The commands used to perform inter-device copies involve three
  players:

  - the initiator of the copy

  - the source of the data

  - the sink of the data

  The initiator must be able to send a single command to the source
  (if copying is modeled as an export) or sink (if modeled as an
  import) to cause the copying to occur.

  The data is transferred directly from the source to the sink,
  without going through the initiator.  This implies, however, that
  there is a transport connection between the source and sink.
  Setting up and selecting this connection requires exposing notion of
  transport identification into the application-layer OSD protocol.

  The recommended solution is as follows.

  1.  The transport protocol must provide a mechanism for third-party
      session creation and manipulation.  This mechanism is outside
      the scope of the OSD protocol document.

  2.  The third-party session creation mechanism must provide an
      opaque identifier for a session.

  3.  The OSD protocol must provide a mechanism for third-party
      application-layer session creation and manipulation.  This
      mechanism is within the scope of the OSD protocol.

  4.  The OSD third-party ALS manipulation mechanism must provide an
      opaque identifier for an ALS.

  5.  The third-party copy command sent from the initiator to the
      source (or sink) must include the transport-layer and
      application-layer opaque session identifiers, so that the source
      (sink) knows what ALS/SLS to send (receive) data on.

  The alternative is to provide a mechanism for encapsulating a
  session description in the third-party copy command used by the
  initiator, so that the source would know what kind of session to
  create to send the data on.  The alternative is less good than the
  proposal for three reasons.  a) it exposes details of
  transport-level sessions to the OSD protocol, potentially raising
  problems with hosting the OSD protocol on multiple transports. b) it
  does not allow for copies to re-use or share sessions.  c) there are
  well-developed standards in some transport environments (especially
  IP, which has SIP/SDP/RSVP/... in place) that we would do well to
  interoperate with.
*
* For T10 Reflector information, send a message with
* 'info t10' (no quotes) in the message body to majordomo at t10.org




More information about the T10 mailing list