iscsi : changes involving tgt portal group tag.

Santosh Rao santoshr at cup.hp.com
Fri Mar 15 16:47:42 PST 2002


John Hufferd wrote:

> 
> 1. Not specifying a *port* in the Login dialogue explicitly
>     is something I am concerned could cause surprises down
>     the road.  Given that a Login is meant to establish an I_T
>     nexus to a port (not to a node), I am rather surprised to see
>     the opposition simply because the proposal is coming late.
> [Huff/]
>     based on my previous note, I do not buy this as a problem, since I do
>     not think this occurs without manual intervention and a significant
>     time interval (and most likely a power down).  This means that it would
>     seem to be a natural thing for the initiator to attempt to rediscover
>     the connection.  It seems that simple wordage that Jim Hafner has
>     suggested for the draft meets this issue.

John,

The procedure to re-config a target portal group is specific to each
product and while it may be reasonable for some product installation
manuals to recommend that all sessions be terminated and the target be
taken offline for a re-config, I don't believe the spec should base its
correct-ness upon this requirement.

After all, with multi-connection session architecture, iscsi does allow
for the target to continue to service active session traffic while being
able to de-commision individual NICs and re-assign them to other portal
groups. Consider also that such a network portal re-assign may only be a
logical admin operation and does not always require the target to be
taken offline or powered off.

Since there is no iscsi protocol specified async notification and
authentication mechanism that prevents connections from being
accidentally established to incorrect portal groups, there is a
possiblity of high-end arrays that advertise 24 x 7 support and online
re-config capabilities, causing initiators to accidentally log into the
wrong portal group during such re-configs.

This can be solved in 2 steps :

a) Have a new async pdu reason code that says "portal group
re-configured" which allows currently logged-in initiator sessions to be
notified and in turn, trigger re-discovery.

b) Send the TPGT as a part of the login and require the tgt port to
authenticate the port name/identifier upon login. 

I don't see these as major changes in the spec. They will block
initiators from accidentally logging into the wrong portal groups, which
needs to be protected against, since it can result in a number of side
effects. If we want to minimize the changes, perhaps, the TPGT could be
introduced as a login key, instead of being in the login pdu header,
thereby, causing no change in the login pdu format.

> 
>     One of the reasons that I am concerned about late proposals, is that
>     the full review of impacts tends not to be done adequately.  All my
>     experience has shown me that the largest number of errors and retrofits
>     occur with the last items added to a product, or spec.  In fact I
>     believe there can be a strong correlation between time of arrival of a
>     change, and the probability of unforeseen impacts.  So yes, I would
>     hate to make changes this late for a problem that I am not sure even
>     exist, and if it does, a rediscovery fixes the problem.

I agree with your risk assessment. However, we do have a correctness
issue in that the protocol does not authenticate port name/identifier
upon login and does not have an async notification scheme to existing
initiators which will prevent accidental [re-]login to incorrect portal
groups.

To depend on Unit Attentions to solve this problem is insufficient due
to the following reasons :

a) The "REPORTED LUNS DATA HAS CHANGED" UA can get cleared if the target
were to be power cycled, prior to I/O activity from the initiator.

b) UAs can get cleared if several other UA conditions that caused the
target to exceed the number of concurrent UAs it can queue and deliver.

c) Requiring that the initiator's legacy SCSI ULP stacks be modified in
order to react to these UAs to address an iscsi specific problem is not
a good idea, since, iscsi drivers must not require changes in the O.S.
SCSI ULPs. Further, iscsi driver writers may not control the O.S. SCSI
ULPs and the change may not be under their control.
by the time the next I/O comes in from an initiator, and reacting to UAs
requires a change in the legacy SCSI ULPs of the O.S' that will run
iscsi, or requires all the iscsi initiators to be 

It is common for all other serial scsi transports (FCP, SRP) to perform
port name/identifier authenticatio upon login.

> [\Huff]
> 
> 2.  > manual reconfiguration (including a probable power down), that the
>      Target
>      > will maintain this key state ..
>     This and a lot of your other text below dwells on the unlikelihood of
>     target not maintaining the state - I agree with you.  My point is
>     *not* that a target would, but the need to design the quickest and
>     most reliable way to communicate the loss of state back to the
>     initiator.
>     I believe addition of TPGT to the Login Request PDU accomplishes that.

> 
>     [Huff/]
>     Since I feel this type of thing is rare if a problem at all,

This is debatable, since I can envision a field engineer using the
portal group re-config as a quick customer site workaround upon
detecting a bug in the multi-connection session implementation in a
target, or a bug in the co-operation of multiple network portal types in
supporting a multi-connection session. 

Without losing the connectivity of the target, it can be converted from
a (2 x 4) connectivity array to a (1 x 8) connectivity array, causing
minimal degradation in its performance and no downtime of the customer's
data.
(m x n => no. of portal groups  x no. of network portals).

Initial implementations of a new protocol are not without their share of
bugs and it would be a useful feature to not have to bring down the
target to perform such re-configs.

>     I think
>     that documentation about not affecting the TPG if state is outstanding,
>     and a suggestion to the Initiator that if an unusual amount of time
>     goes by with the Session Down, that a Rediscovery should be done (as if
>     they would not do that anyway).  So, because of it being rare, if a
>     problem at all, I am not convinced that the right approach is to
>     optimize the response time to restart a session that has been down for
>     a long time anyway.  If it take an extra discovery, I do not think this
>     is a problem.
>     [\Huff]

We seem to be talking about different scenarios here ! I have called out
an issue regarding the re-config of portal groups without requiring a
down-time in the storage (i.e. no disruption to existing sessions),
while you seem to be referring to a session being down for a long time
above. We don't seem to be talking about the same scenario (?).

Again, I agree that a product installation guide can resolve this issue
by requiring all initiators to be quiesced and the storage to be taken
offline for any re-config. However, this limitation should not be
imposed on a scsi transport protocol for ensuring its correctness and
should not limit implementation's capabilites of providing 24x7 uptime.

Thanks in advance for considering all aspects of this issue.

Regards,
Santosh




-- 
##################################
Santosh Rao
Software Design Engineer,
HP-UX iSCSI Driver Team,
Hewlett Packard, Cupertino.
email : santoshr at cup.hp.com
Phone : 408-447-3751
##################################




More information about the T10 mailing list