Andrew Arnott | 8 Jul 23:58

[OpenID] Canonical OpenID url form

What is the canonical form of an OpenID URL? One with the %AB%CD hex encoding for unicode chars in the URL or with the actual unicode chars? For the purposes of displaying to the user and storing in the RP's database.

The spec doesn't seem to have anything to say on this.  The reason I think it's not a simple automatic answer is the unicode chars may be what the user typed in and what exists on the server, but in transit, these characters are translated to %AB%CD in order to be validly escaped URI strings.  So one could argue that the unicode characters are never part of the protocol and therefore should not be in the Claimed Identifier.  On the other hand, if I were japanese and had to look at %AB%CD instead of my native character whenever I saw my OpenID on a web page I'd find it slightly annoying.

_______________________________________________
general mailing list
general <at> openid.net
http://openid.net/mailman/listinfo/general
Johnny Bufu | 9 Jul 06:41

Re: [OpenID] Canonical OpenID url form


On 08/07/08 03:01 PM, Andrew Arnott wrote:
> What is the canonical form of an OpenID URL? One with the %AB%CD hex 
> encoding for unicode chars in the URL or with the actual unicode chars? 
> For the purposes of displaying to the user and storing in the RP's database.
> 
> The spec doesn't seem to have anything to say on this.  

I believe it does say:

4.1.  Protocol Messages
The OpenID Authentication protocol messages are mappings of plain-text 
keys to plain-text values. The keys and values permit the full Unicode 
character set (UCS). When the keys and values need to be converted 
to/from bytes, they MUST be encoded using UTF-8 [RFC3629].

http://openid.net/specs/openid-authentication-2_0.html#anchor4

> The reason I 
> think it's not a simple automatic answer is the unicode chars may be 
> what the user typed in and what exists on the server, but in transit, 
> these characters are translated to %AB%CD in order to be validly escaped 
> URI strings.  

The receiving party must decode them to the original form when they are 
extracted from the transport layer.

> So one could argue that the unicode characters are never 
> part of the protocol 

One would then be ignoring the parts of the protocol that do not deal 
with the transport layer directly.

Johnny
Andrew Arnott | 9 Jul 07:32

Re: [OpenID] Canonical OpenID url form

Thanks, Johnny.  I've had some conversations with a few other people who draw the opposite conclusion and believe that the %AB%CD notation is the canonical form.

You make a good point about having to unescape the characters from the URI just above the transport layer, but
I believe you're applying section 4.1 to the URL when it should only be applied to the key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is the last URL to respond without an HTTP redirect, cannot be in unicode by the URI specification because unicode characters are not allowed, whether that is UTF8 or UTF16. 

Name/value pairs passed as part of a querystring may (and as the section you quote requires) be encoded as UTF-8, but they are subsequently URI encoded as %AB%CD hex characters (thus doubly encoded) so they are actually no longer UTF-8 at the transport layer.  Since the OpenID URL, around which all the identity of OpenID is focused (omiting XRIs which don't suffer from this problem) is at the transport layer of the way the security requirements force the claimed identifier to be discovered, is all about the transport layer, I believe it would be a mistake to add semantics on top of that and call it canonical. 

What I also realized from some other conversations is that this doesn't really matter.  As long as an OP or RP is consistent within itself in storing and comparing Claimed Identifiers, whether it stores and compares %AB%CD or the unicode equivalent character won't matter to anyone, since on the protocol/wire level it is always %AB%CD.  However, I think unescaping the URL and getting the original unicode characters back is very useful and should be done for purposes of displaying to the user.

I think for the security and guaranteed identity of the protocol, there is a meaningful side to this though.  It has not got to do with how the claimed identifier is stored, but rather how a unicode string is escaped for URI transport.  A given unicode string may be represented by more than just one series of bytes.  Unicode characters exist that in UTF-8 or UTF-16 have multiple byte sequences for the same character.  Therefore someone who is typing in their OpenID url to a site using one method during one visit, and then types it in to the same site using a different method on a subsequent visit, will only be identified by the RP as the same visitor if OpenID requires that the RP transforms whatever unicode string is given by the user to the canonical byte form as defined by the unicode standard before transit.  For example, the letter ' &lt;!-- /* Font Definitions */ <at> font-face {font-family:&quot;Cambria Math&quot;; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-1610611985 1107304683 0 0 159 0;} <at> font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:&quot;&quot;; margin-top:0in; margin-right:0in; margin-bottom:10.0pt; margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:&quot;Times New Roman&quot;; mso-bidi-theme-font:minor-bidi;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:&quot;Times New Roman&quot;; mso-bidi-theme-font:minor-bidi;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:10.0pt; line-height:115%;} <at> page Section1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} --&gt; Á' can be encoded as a single character or using composition by adding an accent to the A character.  Both are legal, but the unicode standard defines one as canonical (I think).  But if a string containing this character is not canonicalized first, then although the character is equivalent to the user and to unicode, the encoded %AB%CD string will be different, resulting in security problems for OpenID because people could overload a single Identifier just by using different encodings at an OP, or fail to log into an RP depending on how they craft their string. By the way, I say 'unicode' in the strict sense, applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to just UTF-16, but this problem applies to all unicode character sizes.

So I think OpenID should be more explicit about its unicode support for Identifiers, including mandating a canonical Unicode form. 

On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com> wrote:

On 08/07/08 03:01 PM, Andrew Arnott wrote:
What is the canonical form of an OpenID URL? One with the %AB%CD hex encoding for unicode chars in the URL or with the actual unicode chars? For the purposes of displaying to the user and storing in the RP's database.

The spec doesn't seem to have anything to say on this.  

I believe it does say:

4.1.  Protocol Messages
The OpenID Authentication protocol messages are mappings of plain-text keys to plain-text values. The keys and values permit the full Unicode character set (UCS). When the keys and values need to be converted to/from bytes, they MUST be encoded using UTF-8 [RFC3629].

http://openid.net/specs/openid-authentication-2_0.html#anchor4


The reason I think it's not a simple automatic answer is the unicode chars may be what the user typed in and what exists on the server, but in transit, these characters are translated to %AB%CD in order to be validly escaped URI strings.  

The receiving party must decode them to the original form when they are extracted from the transport layer.


So one could argue that the unicode characters are never part of the protocol

One would then be ignoring the parts of the protocol that do not deal with the transport layer directly.


Johnny


_______________________________________________
general mailing list
general <at> openid.net
http://openid.net/mailman/listinfo/general
Johnny Bufu | 10 Jul 07:51

Re: [OpenID] Canonical OpenID url form

For the record, since this continued in an offline thread:

The issue is around the User-Supplied Identifiers. OpenID defines them
as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
HTTP(S) URI do not allow non-ASCII characters.

So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
should follow the respective authoritative recommendations (i.e.
RFC3987) before presenting such strings to the OpenID layer as HTTP
URIs, and convert them back to IRI form later on when they need to be
displayed back to the users.

Johnny

On 08/07/08 10:32 PM, Andrew Arnott wrote:
> Thanks, Johnny.  I've had some conversations with a few other people 
> who draw the opposite conclusion and believe that the %AB%CD notation
> is the canonical form.
> 
> You make a good point about having to unescape the characters from 
> the URI just above the transport layer, but I believe you're applying
>  section 4.1 to the URL when it should only be applied to the 
> key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is 
> the last URL to respond without an HTTP redirect, cannot be in 
> unicode by the URI specification because unicode characters are not 
> allowed, whether that is UTF8 or UTF16.
> 
> Name/value pairs passed as part of a querystring may (and as the 
> section you quote requires) be encoded as UTF-8, but they are 
> subsequently URI encoded as %AB%CD hex characters (thus doubly 
> encoded) so they are actually no longer UTF-8 at the transport layer.
>  Since the OpenID URL, around which all the identity of OpenID is 
> focused (omiting XRIs which don't suffer from this problem) /is/ at 
> the transport layer of the way the security requirements force the 
> claimed identifier to be discovered, is all about the transport 
> layer, I believe it would be a mistake to add semantics on top of 
> that and call it canonical.
> 
> What I also realized from some other conversations is that this 
> doesn't really matter.  As long as an OP or RP is consistent within 
> itself in storing and comparing Claimed Identifiers, whether it 
> stores and compares %AB%CD or the unicode equivalent character won't 
> matter to anyone, since on the protocol/wire level it is always 
> %AB%CD.  However, I think unescaping the URL and getting the original
>  unicode characters back is very useful and should be done for 
> purposes of displaying to the user.
> 
> I think for the security and guaranteed identity of the protocol, 
> there is a meaningful side to this though.  It has not got to do with
>  how the claimed identifier is stored, but rather how a unicode 
> string is escaped for URI transport.  A given unicode string may be 
> represented by more than just one series of bytes.  Unicode 
> characters exist that in UTF-8 or UTF-16 have multiple byte sequences
>  /for the same character/. Therefore someone who is typing in their 
> OpenID url to a site using one method during one visit, and then 
> types it in to the same site using a different method on a subsequent
>  visit, will only be identified by the RP as the same visitor if 
> OpenID requires that the RP transforms whatever unicode string is 
> given by the user to the canonical byte form as defined by the 
> unicode standard before transit.  For example, the letter 'Á' can be 
> encoded as a single character or using composition by adding an 
> accent to the A character.  Both are legal, but the unicode standard 
> defines one as canonical (I think).  But if a string containing this 
> character is not canonicalized first, then although the character is 
> equivalent to the user and to unicode, the encoded %AB%CD string will
> be different, resulting in security problems for OpenID because 
> people could overload a single Identifier just by using different 
> encodings at an OP, or fail to log into an RP depending on how they 
> craft their string. By the way, I say 'unicode' in the strict sense, 
> applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> just UTF-16, but this problem applies to all unicode character sizes.
> 
> 
> 
> 
> So I think OpenID should be more explicit about its unicode support 
> for Identifiers, including mandating a canonical Unicode form.
> 
> On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com 
> <mailto:johnny.bufu <at> gmail.com>> wrote:
> 
> 
> On 08/07/08 03:01 PM, Andrew Arnott wrote:
> 
> What is the canonical form of an OpenID URL? One with the %AB%CD hex 
> encoding for unicode chars in the URL or with the actual unicode 
> chars? For the purposes of displaying to the user and storing in the 
> RP's database.
> 
> The spec doesn't seem to have anything to say on this.
> 
> 
> I believe it does say:
> 
> 4.1.  Protocol Messages The OpenID Authentication protocol messages 
> are mappings of plain-text keys to plain-text values. The keys and 
> values permit the full Unicode character set (UCS). When the keys and
>  values need to be converted to/from bytes, they MUST be encoded 
> using UTF-8 [RFC3629].
> 
> http://openid.net/specs/openid-authentication-2_0.html#anchor4
> 
> 
> The reason I think it's not a simple automatic answer is the unicode 
> chars may be what the user typed in and what exists on the server, 
> but in transit, these characters are translated to %AB%CD in order to
>  be validly escaped URI strings.
> 
> 
> The receiving party must decode them to the original form when they 
> are extracted from the transport layer.
> 
> 
> So one could argue that the unicode characters are never part of the 
> protocol
> 
> 
> One would then be ignoring the parts of the protocol that do not deal
>  with the transport layer directly.
> 
> 
> Johnny
> 
> 
> !DSPAM:139,48744d86221113907413095!
Drummond Reed | 10 Jul 08:33

Re: [OpenID] Canonical OpenID url form

Also for the record, XRIs (which use the IRI character set) have a very
simple defined transformation into IRIs. Thus when an XRI needs to be sent
over-the-wire in an HTTP(S) URI, it must first be transformed into an IRI,
then you follow the IRI spec (RFC 3987) to transform into a URI as Johnny
describes below. Reverse the process to display back to the user.

See
http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-cs.
html for all the gory details (and they are gory - Unicode is hard).

=Drummond 

> -----Original Message-----
> From: general-bounces <at> openid.net [mailto:general-bounces <at> openid.net] On
> Behalf Of Johnny Bufu
> Sent: Wednesday, July 09, 2008 10:52 PM
> To: Andrew Arnott
> Cc: OpenID List
> Subject: Re: [OpenID] Canonical OpenID url form
> 
> For the record, since this continued in an offline thread:
> 
> The issue is around the User-Supplied Identifiers. OpenID defines them
> as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
> HTTP(S) URI do not allow non-ASCII characters.
> 
> So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
> should follow the respective authoritative recommendations (i.e.
> RFC3987) before presenting such strings to the OpenID layer as HTTP
> URIs, and convert them back to IRI form later on when they need to be
> displayed back to the users.
> 
> Johnny
> 
> On 08/07/08 10:32 PM, Andrew Arnott wrote:
> > Thanks, Johnny.  I've had some conversations with a few other people
> > who draw the opposite conclusion and believe that the %AB%CD notation
> > is the canonical form.
> >
> > You make a good point about having to unescape the characters from
> > the URI just above the transport layer, but I believe you're applying
> >  section 4.1 to the URL when it should only be applied to the
> > key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is
> > the last URL to respond without an HTTP redirect, cannot be in
> > unicode by the URI specification because unicode characters are not
> > allowed, whether that is UTF8 or UTF16.
> >
> > Name/value pairs passed as part of a querystring may (and as the
> > section you quote requires) be encoded as UTF-8, but they are
> > subsequently URI encoded as %AB%CD hex characters (thus doubly
> > encoded) so they are actually no longer UTF-8 at the transport layer.
> >  Since the OpenID URL, around which all the identity of OpenID is
> > focused (omiting XRIs which don't suffer from this problem) /is/ at
> > the transport layer of the way the security requirements force the
> > claimed identifier to be discovered, is all about the transport
> > layer, I believe it would be a mistake to add semantics on top of
> > that and call it canonical.
> >
> > What I also realized from some other conversations is that this
> > doesn't really matter.  As long as an OP or RP is consistent within
> > itself in storing and comparing Claimed Identifiers, whether it
> > stores and compares %AB%CD or the unicode equivalent character won't
> > matter to anyone, since on the protocol/wire level it is always
> > %AB%CD.  However, I think unescaping the URL and getting the original
> >  unicode characters back is very useful and should be done for
> > purposes of displaying to the user.
> >
> > I think for the security and guaranteed identity of the protocol,
> > there is a meaningful side to this though.  It has not got to do with
> >  how the claimed identifier is stored, but rather how a unicode
> > string is escaped for URI transport.  A given unicode string may be
> > represented by more than just one series of bytes.  Unicode
> > characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> >  /for the same character/. Therefore someone who is typing in their
> > OpenID url to a site using one method during one visit, and then
> > types it in to the same site using a different method on a subsequent
> >  visit, will only be identified by the RP as the same visitor if
> > OpenID requires that the RP transforms whatever unicode string is
> > given by the user to the canonical byte form as defined by the
> > unicode standard before transit.  For example, the letter 'Á' can be
> > encoded as a single character or using composition by adding an
> > accent to the A character.  Both are legal, but the unicode standard
> > defines one as canonical (I think).  But if a string containing this
> > character is not canonicalized first, then although the character is
> > equivalent to the user and to unicode, the encoded %AB%CD string will
> > be different, resulting in security problems for OpenID because
> > people could overload a single Identifier just by using different
> > encodings at an OP, or fail to log into an RP depending on how they
> > craft their string. By the way, I say 'unicode' in the strict sense,
> > applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> > just UTF-16, but this problem applies to all unicode character sizes.
> >
> >
> >
> >
> > So I think OpenID should be more explicit about its unicode support
> > for Identifiers, including mandating a canonical Unicode form.
> >
> > On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com
> > <mailto:johnny.bufu <at> gmail.com>> wrote:
> >
> >
> > On 08/07/08 03:01 PM, Andrew Arnott wrote:
> >
> > What is the canonical form of an OpenID URL? One with the %AB%CD hex
> > encoding for unicode chars in the URL or with the actual unicode
> > chars? For the purposes of displaying to the user and storing in the
> > RP's database.
> >
> > The spec doesn't seem to have anything to say on this.
> >
> >
> > I believe it does say:
> >
> > 4.1.  Protocol Messages The OpenID Authentication protocol messages
> > are mappings of plain-text keys to plain-text values. The keys and
> > values permit the full Unicode character set (UCS). When the keys and
> >  values need to be converted to/from bytes, they MUST be encoded
> > using UTF-8 [RFC3629].
> >
> > http://openid.net/specs/openid-authentication-2_0.html#anchor4
> >
> >
> > The reason I think it's not a simple automatic answer is the unicode
> > chars may be what the user typed in and what exists on the server,
> > but in transit, these characters are translated to %AB%CD in order to
> >  be validly escaped URI strings.
> >
> >
> > The receiving party must decode them to the original form when they
> > are extracted from the transport layer.
> >
> >
> > So one could argue that the unicode characters are never part of the
> > protocol
> >
> >
> > One would then be ignoring the parts of the protocol that do not deal
> >  with the transport layer directly.
> >
> >
> > Johnny
> >
> >
> > !DSPAM:139,48744d86221113907413095!
> _______________________________________________
> general mailing list
> general <at> openid.net
> http://openid.net/mailman/listinfo/general
Peter Williams | 10 Jul 08:36

Re: [OpenID] Canonical OpenID url form

So the short form of the story is: use xri for unicode (and then transform the xri into an https hxri).

Its been a month since I studied xri (and thus have forgotten 80 percent of it). I recall there was a syntax to
identify the address of the initial resolver. Is there a way tha this became the domain name componnt of the hxri

-----Original Message-----
From: Drummond Reed <drummond.reed <at> cordance.net>
Sent: Wednesday, July 09, 2008 11:34 PM
To: 'Johnny Bufu' <johnny.bufu <at> gmail.com>; 'Andrew Arnott' <andrewarnott <at> gmail.com>
Cc: 'OpenID List' <general <at> openid.net>
Subject: Re: [OpenID] Canonical OpenID url form

Also for the record, XRIs (which use the IRI character set) have a very
simple defined transformation into IRIs. Thus when an XRI needs to be sent
over-the-wire in an HTTP(S) URI, it must first be transformed into an IRI,
then you follow the IRI spec (RFC 3987) to transform into a URI as Johnny
describes below. Reverse the process to display back to the user.

See
http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-cs.
html for all the gory details (and they are gory - Unicode is hard).

=Drummond

> -----Original Message-----
> From: general-bounces <at> openid.net [mailto:general-bounces <at> openid.net] On
> Behalf Of Johnny Bufu
> Sent: Wednesday, July 09, 2008 10:52 PM
> To: Andrew Arnott
> Cc: OpenID List
> Subject: Re: [OpenID] Canonical OpenID url form
>
> For the record, since this continued in an offline thread:
>
> The issue is around the User-Supplied Identifiers. OpenID defines them
> as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
> HTTP(S) URI do not allow non-ASCII characters.
>
> So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
> should follow the respective authoritative recommendations (i.e.
> RFC3987) before presenting such strings to the OpenID layer as HTTP
> URIs, and convert them back to IRI form later on when they need to be
> displayed back to the users.
>
> Johnny
>
> On 08/07/08 10:32 PM, Andrew Arnott wrote:
> > Thanks, Johnny.  I've had some conversations with a few other people
> > who draw the opposite conclusion and believe that the %AB%CD notation
> > is the canonical form.
> >
> > You make a good point about having to unescape the characters from
> > the URI just above the transport layer, but I believe you're applying
> >  section 4.1 to the URL when it should only be applied to the
> > key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is
> > the last URL to respond without an HTTP redirect, cannot be in
> > unicode by the URI specification because unicode characters are not
> > allowed, whether that is UTF8 or UTF16.
> >
> > Name/value pairs passed as part of a querystring may (and as the
> > section you quote requires) be encoded as UTF-8, but they are
> > subsequently URI encoded as %AB%CD hex characters (thus doubly
> > encoded) so they are actually no longer UTF-8 at the transport layer.
> >  Since the OpenID URL, around which all the identity of OpenID is
> > focused (omiting XRIs which don't suffer from this problem) /is/ at
> > the transport layer of the way the security requirements force the
> > claimed identifier to be discovered, is all about the transport
> > layer, I believe it would be a mistake to add semantics on top of
> > that and call it canonical.
> >
> > What I also realized from some other conversations is that this
> > doesn't really matter.  As long as an OP or RP is consistent within
> > itself in storing and comparing Claimed Identifiers, whether it
> > stores and compares %AB%CD or the unicode equivalent character won't
> > matter to anyone, since on the protocol/wire level it is always
> > %AB%CD.  However, I think unescaping the URL and getting the original
> >  unicode characters back is very useful and should be done for
> > purposes of displaying to the user.
> >
> > I think for the security and guaranteed identity of the protocol,
> > there is a meaningful side to this though.  It has not got to do with
> >  how the claimed identifier is stored, but rather how a unicode
> > string is escaped for URI transport.  A given unicode string may be
> > represented by more than just one series of bytes.  Unicode
> > characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> >  /for the same character/. Therefore someone who is typing in their
> > OpenID url to a site using one method during one visit, and then
> > types it in to the same site using a different method on a subsequent
> >  visit, will only be identified by the RP as the same visitor if
> > OpenID requires that the RP transforms whatever unicode string is
> > given by the user to the canonical byte form as defined by the
> > unicode standard before transit.  For example, the letter 'Á' can be
> > encoded as a single character or using composition by adding an
> > accent to the A character.  Both are legal, but the unicode standard
> > defines one as canonical (I think).  But if a string containing this
> > character is not canonicalized first, then although the character is
> > equivalent to the user and to unicode, the encoded %AB%CD string will
> > be different, resulting in security problems for OpenID because
> > people could overload a single Identifier just by using different
> > encodings at an OP, or fail to log into an RP depending on how they
> > craft their string. By the way, I say 'unicode' in the strict sense,
> > applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> > just UTF-16, but this problem applies to all unicode character sizes.
> >
> >
> >
> >
> > So I think OpenID should be more explicit about its unicode support
> > for Identifiers, including mandating a canonical Unicode form.
> >
> > On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com
> > <mailto:johnny.bufu <at> gmail.com>> wrote:
> >
> >
> > On 08/07/08 03:01 PM, Andrew Arnott wrote:
> >
> > What is the canonical form of an OpenID URL? One with the %AB%CD hex
> > encoding for unicode chars in the URL or with the actual unicode
> > chars? For the purposes of displaying to the user and storing in the
> > RP's database.
> >
> > The spec doesn't seem to have anything to say on this.
> >
> >
> > I believe it does say:
> >
> > 4.1.  Protocol Messages The OpenID Authentication protocol messages
> > are mappings of plain-text keys to plain-text values. The keys and
> > values permit the full Unicode character set (UCS). When the keys and
> >  values need to be converted to/from bytes, they MUST be encoded
> > using UTF-8 [RFC3629].
> >
> > http://openid.net/specs/openid-authentication-2_0.html#anchor4
> >
> >
> > The reason I think it's not a simple automatic answer is the unicode
> > chars may be what the user typed in and what exists on the server,
> > but in transit, these characters are translated to %AB%CD in order to
> >  be validly escaped URI strings.
> >
> >
> > The receiving party must decode them to the original form when they
> > are extracted from the transport layer.
> >
> >
> > So one could argue that the unicode characters are never part of the
> > protocol
> >
> >
> > One would then be ignoring the parts of the protocol that do not deal
> >  with the transport layer directly.
> >
> >
> > Johnny
> >
> >
> > !DSPAM:139,48744d86221113907413095!
> _______________________________________________
> general mailing list
> general <at> openid.net
> http://openid.net/mailman/listinfo/general

_______________________________________________
general mailing list
general <at> openid.net
http://openid.net/mailman/listinfo/general
Martin Atkins | 10 Jul 20:10

Re: [OpenID] Canonical OpenID url form

Peter Williams wrote:
> So the short form of the story is: use xri for unicode (and then transform the xri into an https hxri).
> 
> Its been a month since I studied xri (and thus have forgotten 80 percent of it). I recall there was a syntax to
identify the address of the initial resolver. Is there a way tha this became the domain name componnt of the hxri
> 

XRI is not required to use non-ASCII characters in your OpenID Identifier.

What I take from this discussion is that the canonical form is 
percent-encoded UTF-8, but when it comes to displaying identifiers to 
end-users it can be transformed back into the real unicode characters 
using the same rules as browsers use.

It'd probably be a good idea to test how the various RP and OP libraries 
deal with this, though. I expect that in practice some implementations 
will get this wrong. I just tested the Net::OpenID::Consumer perl 
library and it only gets this right if the caller happens to pass it an 
already UTF-8-encoded string.
Drummond Reed | 10 Jul 20:31

Re: [OpenID] Canonical OpenID url form

Martin's right, Peter -- XRI is one option for Unicode. But you can also use
an internationalized domain name
(http://en.wikipedia.org/wiki/Internationalized_domain_name) in a regular
URL. It uses Punycode (http://en.wikipedia.org/wiki/Punycode).

You can also turn an XRI into an URL by adding an XRI proxy resolver prefix
(such as http://xri.net/ -- see my sig below for an example). In that
approach the proxy resolver prefix has nothing to do with the XRI itself, so
there's no need to internationalize the domain name.

=Drummond 
http://xri.net/=drummond.reed

> -----Original Message-----
> From: Peter Williams [mailto:pwilliams <at> rapattoni.com]
> Sent: Wednesday, July 09, 2008 11:40 PM
> To: Drummond Reed; 'Johnny Bufu'; 'Andrew Arnott'
> Cc: 'OpenID List'
> Subject: RE: [OpenID] Canonical OpenID url form
> 
> So the short form of the story is: use xri for unicode (and then transform
> the xri into an https hxri).
> 
> Its been a month since I studied xri (and thus have forgotten 80 percent
> of it). I recall there was a syntax to identify the address of the initial
> resolver. Is there a way tha this became the domain name componnt of the
> hxri
> 
> -----Original Message-----
> From: Drummond Reed <drummond.reed <at> cordance.net>
> Sent: Wednesday, July 09, 2008 11:34 PM
> To: 'Johnny Bufu' <johnny.bufu <at> gmail.com>; 'Andrew Arnott'
> <andrewarnott <at> gmail.com>
> Cc: 'OpenID List' <general <at> openid.net>
> Subject: Re: [OpenID] Canonical OpenID url form
> 
> 
> Also for the record, XRIs (which use the IRI character set) have a very
> simple defined transformation into IRIs. Thus when an XRI needs to be sent
> over-the-wire in an HTTP(S) URI, it must first be transformed into an IRI,
> then you follow the IRI spec (RFC 3987) to transform into a URI as Johnny
> describes below. Reverse the process to display back to the user.
> 
> See
> http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-
> cs.
> html for all the gory details (and they are gory - Unicode is hard).
> 
> =Drummond
> 
> > -----Original Message-----
> > From: general-bounces <at> openid.net [mailto:general-bounces <at> openid.net] On
> > Behalf Of Johnny Bufu
> > Sent: Wednesday, July 09, 2008 10:52 PM
> > To: Andrew Arnott
> > Cc: OpenID List
> > Subject: Re: [OpenID] Canonical OpenID url form
> >
> > For the record, since this continued in an offline thread:
> >
> > The issue is around the User-Supplied Identifiers. OpenID defines them
> > as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
> > HTTP(S) URI do not allow non-ASCII characters.
> >
> > So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
> > should follow the respective authoritative recommendations (i.e.
> > RFC3987) before presenting such strings to the OpenID layer as HTTP
> > URIs, and convert them back to IRI form later on when they need to be
> > displayed back to the users.
> >
> > Johnny
> >
> > On 08/07/08 10:32 PM, Andrew Arnott wrote:
> > > Thanks, Johnny.  I've had some conversations with a few other people
> > > who draw the opposite conclusion and believe that the %AB%CD notation
> > > is the canonical form.
> > >
> > > You make a good point about having to unescape the characters from
> > > the URI just above the transport layer, but I believe you're applying
> > >  section 4.1 to the URL when it should only be applied to the
> > > key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is
> > > the last URL to respond without an HTTP redirect, cannot be in
> > > unicode by the URI specification because unicode characters are not
> > > allowed, whether that is UTF8 or UTF16.
> > >
> > > Name/value pairs passed as part of a querystring may (and as the
> > > section you quote requires) be encoded as UTF-8, but they are
> > > subsequently URI encoded as %AB%CD hex characters (thus doubly
> > > encoded) so they are actually no longer UTF-8 at the transport layer.
> > >  Since the OpenID URL, around which all the identity of OpenID is
> > > focused (omiting XRIs which don't suffer from this problem) /is/ at
> > > the transport layer of the way the security requirements force the
> > > claimed identifier to be discovered, is all about the transport
> > > layer, I believe it would be a mistake to add semantics on top of
> > > that and call it canonical.
> > >
> > > What I also realized from some other conversations is that this
> > > doesn't really matter.  As long as an OP or RP is consistent within
> > > itself in storing and comparing Claimed Identifiers, whether it
> > > stores and compares %AB%CD or the unicode equivalent character won't
> > > matter to anyone, since on the protocol/wire level it is always
> > > %AB%CD.  However, I think unescaping the URL and getting the original
> > >  unicode characters back is very useful and should be done for
> > > purposes of displaying to the user.
> > >
> > > I think for the security and guaranteed identity of the protocol,
> > > there is a meaningful side to this though.  It has not got to do with
> > >  how the claimed identifier is stored, but rather how a unicode
> > > string is escaped for URI transport.  A given unicode string may be
> > > represented by more than just one series of bytes.  Unicode
> > > characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> > >  /for the same character/. Therefore someone who is typing in their
> > > OpenID url to a site using one method during one visit, and then
> > > types it in to the same site using a different method on a subsequent
> > >  visit, will only be identified by the RP as the same visitor if
> > > OpenID requires that the RP transforms whatever unicode string is
> > > given by the user to the canonical byte form as defined by the
> > > unicode standard before transit.  For example, the letter 'Á' can be
> > > encoded as a single character or using composition by adding an
> > > accent to the A character.  Both are legal, but the unicode standard
> > > defines one as canonical (I think).  But if a string containing this
> > > character is not canonicalized first, then although the character is
> > > equivalent to the user and to unicode, the encoded %AB%CD string will
> > > be different, resulting in security problems for OpenID because
> > > people could overload a single Identifier just by using different
> > > encodings at an OP, or fail to log into an RP depending on how they
> > > craft their string. By the way, I say 'unicode' in the strict sense,
> > > applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> > > just UTF-16, but this problem applies to all unicode character sizes.
> > >
> > >
> > >
> > >
> > > So I think OpenID should be more explicit about its unicode support
> > > for Identifiers, including mandating a canonical Unicode form.
> > >
> > > On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com
> > > <mailto:johnny.bufu <at> gmail.com>> wrote:
> > >
> > >
> > > On 08/07/08 03:01 PM, Andrew Arnott wrote:
> > >
> > > What is the canonical form of an OpenID URL? One with the %AB%CD hex
> > > encoding for unicode chars in the URL or with the actual unicode
> > > chars? For the purposes of displaying to the user and storing in the
> > > RP's database.
> > >
> > > The spec doesn't seem to have anything to say on this.
> > >
> > >
> > > I believe it does say:
> > >
> > > 4.1.  Protocol Messages The OpenID Authentication protocol messages
> > > are mappings of plain-text keys to plain-text values. The keys and
> > > values permit the full Unicode character set (UCS). When the keys and
> > >  values need to be converted to/from bytes, they MUST be encoded
> > > using UTF-8 [RFC3629].
> > >
> > > http://openid.net/specs/openid-authentication-2_0.html#anchor4
> > >
> > >
> > > The reason I think it's not a simple automatic answer is the unicode
> > > chars may be what the user typed in and what exists on the server,
> > > but in transit, these characters are translated to %AB%CD in order to
> > >  be validly escaped URI strings.
> > >
> > >
> > > The receiving party must decode them to the original form when they
> > > are extracted from the transport layer.
> > >
> > >
> > > So one could argue that the unicode characters are never part of the
> > > protocol
> > >
> > >
> > > One would then be ignoring the parts of the protocol that do not deal
> > >  with the transport layer directly.
> > >
> > >
> > > Johnny
> > >
> > >
> > > !DSPAM:139,48744d86221113907413095!
> > _______________________________________________
> > general mailing list
> > general <at> openid.net
> > http://openid.net/mailman/listinfo/general
> 
> _______________________________________________
> general mailing list
> general <at> openid.net
> http://openid.net/mailman/listinfo/general
Peter Williams | 10 Jul 21:07

Re: [OpenID] Canonical OpenID url form

I was thinking like the lazy programmer I am: use XRI libraries to address all advanced
culture/language/encoding issues. Then, as you say, prefix that with a constant
http://<int-domain>/. Then, per IRI conventions, rewrite that so, as in Arabic, one gets right to left
URL visuals (<Arabic-script>//:ptth) to suit the population that is not particularly enamored with
Roman culture.

-----Original Message-----
From: Drummond Reed [mailto:drummond.reed <at> cordance.net]
Sent: Thursday, July 10, 2008 11:31 AM
To: Peter Williams; 'Johnny Bufu'; 'Andrew Arnott'
Cc: 'OpenID List'
Subject: RE: [OpenID] Canonical OpenID url form

Martin's right, Peter -- XRI is one option for Unicode. But you can also use
an internationalized domain name
(http://en.wikipedia.org/wiki/Internationalized_domain_name) in a regular
URL. It uses Punycode (http://en.wikipedia.org/wiki/Punycode).

You can also turn an XRI into an URL by adding an XRI proxy resolver prefix
(such as http://xri.net/ -- see my sig below for an example). In that
approach the proxy resolver prefix has nothing to do with the XRI itself, so
there's no need to internationalize the domain name.

=Drummond
http://xri.net/=drummond.reed

> -----Original Message-----
> From: Peter Williams [mailto:pwilliams <at> rapattoni.com]
> Sent: Wednesday, July 09, 2008 11:40 PM
> To: Drummond Reed; 'Johnny Bufu'; 'Andrew Arnott'
> Cc: 'OpenID List'
> Subject: RE: [OpenID] Canonical OpenID url form
>
> So the short form of the story is: use xri for unicode (and then transform
> the xri into an https hxri).
>
> Its been a month since I studied xri (and thus have forgotten 80 percent
> of it). I recall there was a syntax to identify the address of the initial
> resolver. Is there a way tha this became the domain name componnt of the
> hxri
>
> -----Original Message-----
> From: Drummond Reed <drummond.reed <at> cordance.net>
> Sent: Wednesday, July 09, 2008 11:34 PM
> To: 'Johnny Bufu' <johnny.bufu <at> gmail.com>; 'Andrew Arnott'
> <andrewarnott <at> gmail.com>
> Cc: 'OpenID List' <general <at> openid.net>
> Subject: Re: [OpenID] Canonical OpenID url form
>
>
> Also for the record, XRIs (which use the IRI character set) have a very
> simple defined transformation into IRIs. Thus when an XRI needs to be sent
> over-the-wire in an HTTP(S) URI, it must first be transformed into an IRI,
> then you follow the IRI spec (RFC 3987) to transform into a URI as Johnny
> describes below. Reverse the process to display back to the user.
>
> See
> http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-
> cs.
> html for all the gory details (and they are gory - Unicode is hard).
>
> =Drummond
>
> > -----Original Message-----
> > From: general-bounces <at> openid.net [mailto:general-bounces <at> openid.net] On
> > Behalf Of Johnny Bufu
> > Sent: Wednesday, July 09, 2008 10:52 PM
> > To: Andrew Arnott
> > Cc: OpenID List
> > Subject: Re: [OpenID] Canonical OpenID url form
> >
> > For the record, since this continued in an offline thread:
> >
> > The issue is around the User-Supplied Identifiers. OpenID defines them
> > as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
> > HTTP(S) URI do not allow non-ASCII characters.
> >
> > So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
> > should follow the respective authoritative recommendations (i.e.
> > RFC3987) before presenting such strings to the OpenID layer as HTTP
> > URIs, and convert them back to IRI form later on when they need to be
> > displayed back to the users.
> >
> > Johnny
> >
> > On 08/07/08 10:32 PM, Andrew Arnott wrote:
> > > Thanks, Johnny.  I've had some conversations with a few other people
> > > who draw the opposite conclusion and believe that the %AB%CD notation
> > > is the canonical form.
> > >
> > > You make a good point about having to unescape the characters from
> > > the URI just above the transport layer, but I believe you're applying
> > >  section 4.1 to the URL when it should only be applied to the
> > > key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is
> > > the last URL to respond without an HTTP redirect, cannot be in
> > > unicode by the URI specification because unicode characters are not
> > > allowed, whether that is UTF8 or UTF16.
> > >
> > > Name/value pairs passed as part of a querystring may (and as the
> > > section you quote requires) be encoded as UTF-8, but they are
> > > subsequently URI encoded as %AB%CD hex characters (thus doubly
> > > encoded) so they are actually no longer UTF-8 at the transport layer.
> > >  Since the OpenID URL, around which all the identity of OpenID is
> > > focused (omiting XRIs which don't suffer from this problem) /is/ at
> > > the transport layer of the way the security requirements force the
> > > claimed identifier to be discovered, is all about the transport
> > > layer, I believe it would be a mistake to add semantics on top of
> > > that and call it canonical.
> > >
> > > What I also realized from some other conversations is that this
> > > doesn't really matter.  As long as an OP or RP is consistent within
> > > itself in storing and comparing Claimed Identifiers, whether it
> > > stores and compares %AB%CD or the unicode equivalent character won't
> > > matter to anyone, since on the protocol/wire level it is always
> > > %AB%CD.  However, I think unescaping the URL and getting the original
> > >  unicode characters back is very useful and should be done for
> > > purposes of displaying to the user.
> > >
> > > I think for the security and guaranteed identity of the protocol,
> > > there is a meaningful side to this though.  It has not got to do with
> > >  how the claimed identifier is stored, but rather how a unicode
> > > string is escaped for URI transport.  A given unicode string may be
> > > represented by more than just one series of bytes.  Unicode
> > > characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> > >  /for the same character/. Therefore someone who is typing in their
> > > OpenID url to a site using one method during one visit, and then
> > > types it in to the same site using a different method on a subsequent
> > >  visit, will only be identified by the RP as the same visitor if
> > > OpenID requires that the RP transforms whatever unicode string is
> > > given by the user to the canonical byte form as defined by the
> > > unicode standard before transit.  For example, the letter 'Á' can be
> > > encoded as a single character or using composition by adding an
> > > accent to the A character.  Both are legal, but the unicode standard
> > > defines one as canonical (I think).  But if a string containing this
> > > character is not canonicalized first, then although the character is
> > > equivalent to the user and to unicode, the encoded %AB%CD string will
> > > be different, resulting in security problems for OpenID because
> > > people could overload a single Identifier just by using different
> > > encodings at an OP, or fail to log into an RP depending on how they
> > > craft their string. By the way, I say 'unicode' in the strict sense,
> > > applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> > > just UTF-16, but this problem applies to all unicode character sizes.
> > >
> > >
> > >
> > >
> > > So I think OpenID should be more explicit about its unicode support
> > > for Identifiers, including mandating a canonical Unicode form.
> > >
> > > On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com
> > > <mailto:johnny.bufu <at> gmail.com>> wrote:
> > >
> > >
> > > On 08/07/08 03:01 PM, Andrew Arnott wrote:
> > >
> > > What is the canonical form of an OpenID URL? One with the %AB%CD hex
> > > encoding for unicode chars in the URL or with the actual unicode
> > > chars? For the purposes of displaying to the user and storing in the
> > > RP's database.
> > >
> > > The spec doesn't seem to have anything to say on this.
> > >
> > >
> > > I believe it does say:
> > >
> > > 4.1.  Protocol Messages The OpenID Authentication protocol messages
> > > are mappings of plain-text keys to plain-text values. The keys and
> > > values permit the full Unicode character set (UCS). When the keys and
> > >  values need to be converted to/from bytes, they MUST be encoded
> > > using UTF-8 [RFC3629].
> > >
> > > http://openid.net/specs/openid-authentication-2_0.html#anchor4
> > >
> > >
> > > The reason I think it's not a simple automatic answer is the unicode
> > > chars may be what the user typed in and what exists on the server,
> > > but in transit, these characters are translated to %AB%CD in order to
> > >  be validly escaped URI strings.
> > >
> > >
> > > The receiving party must decode them to the original form when they
> > > are extracted from the transport layer.
> > >
> > >
> > > So one could argue that the unicode characters are never part of the
> > > protocol
> > >
> > >
> > > One would then be ignoring the parts of the protocol that do not deal
> > >  with the transport layer directly.
> > >
> > >
> > > Johnny
> > >
> > >
> > > !DSPAM:139,48744d86221113907413095!
> > _______________________________________________
> > general mailing list
> > general <at> openid.net
> > http://openid.net/mailman/listinfo/general
>
> _______________________________________________
> general mailing list
> general <at> openid.net
> http://openid.net/mailman/listinfo/general
Andrew Arnott | 10 Jul 22:02

Re: [OpenID] Canonical OpenID url form

If XRIs allow unicode characters and URIs do not, then prefixing http://xri.net/ in front of an XRI does not guarantee a proper URI.  It merely makes it look like one.  But if foreign characters exist in the XRI, they must be properly % encoded for the result to be a proper URI.

On Thu, Jul 10, 2008 at 11:31 AM, Drummond Reed <drummond.reed <at> cordance.net> wrote:
Martin's right, Peter -- XRI is one option for Unicode. But you can also use
an internationalized domain name
(http://en.wikipedia.org/wiki/Internationalized_domain_name) in a regular
URL. It uses Punycode (http://en.wikipedia.org/wiki/Punycode).

You can also turn an XRI into an URL by adding an XRI proxy resolver prefix
(such as http://xri.net/ -- see my sig below for an example). In that
approach the proxy resolver prefix has nothing to do with the XRI itself, so
there's no need to internationalize the domain name.

=Drummond
http://xri.net/=drummond.reed


> -----Original Message-----
> From: Peter Williams [mailto:pwilliams <at> rapattoni.com]
> Sent: Wednesday, July 09, 2008 11:40 PM
> To: Drummond Reed; 'Johnny Bufu'; 'Andrew Arnott'
> Cc: 'OpenID List'
> Subject: RE: [OpenID] Canonical OpenID url form
>
> So the short form of the story is: use xri for unicode (and then transform
> the xri into an https hxri).
>
> Its been a month since I studied xri (and thus have forgotten 80 percent
> of it). I recall there was a syntax to identify the address of the initial
> resolver. Is there a way tha this became the domain name componnt of the
> hxri
>
> -----Original Message-----
> From: Drummond Reed <drummond.reed <at> cordance.net>
> Sent: Wednesday, July 09, 2008 11:34 PM
> To: 'Johnny Bufu' <johnny.bufu <at> gmail.com>; 'Andrew Arnott'
> <andrewarnott <at> gmail.com>
> Cc: 'OpenID List' <general <at> openid.net>
> Subject: Re: [OpenID] Canonical OpenID url form
>
>
> Also for the record, XRIs (which use the IRI character set) have a very
> simple defined transformation into IRIs. Thus when an XRI needs to be sent
> over-the-wire in an HTTP(S) URI, it must first be transformed into an IRI,
> then you follow the IRI spec (RFC 3987) to transform into a URI as Johnny
> describes below. Reverse the process to display back to the user.
>
> See
> http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-
> cs.
> html for all the gory details (and they are gory - Unicode is hard).
>
> =Drummond
>
> > -----Original Message-----
> > From: general-bounces <at> openid.net [mailto:general-bounces <at> openid.net] On
> > Behalf Of Johnny Bufu
> > Sent: Wednesday, July 09, 2008 10:52 PM
> > To: Andrew Arnott
> > Cc: OpenID List
> > Subject: Re: [OpenID] Canonical OpenID url form
> >
> > For the record, since this continued in an offline thread:
> >
> > The issue is around the User-Supplied Identifiers. OpenID defines them
> > as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
> > HTTP(S) URI do not allow non-ASCII characters.
> >
> > So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
> > should follow the respective authoritative recommendations (i.e.
> > RFC3987) before presenting such strings to the OpenID layer as HTTP
> > URIs, and convert them back to IRI form later on when they need to be
> > displayed back to the users.
> >
> > Johnny
> >
> > On 08/07/08 10:32 PM, Andrew Arnott wrote:
> > > Thanks, Johnny.  I've had some conversations with a few other people
> > > who draw the opposite conclusion and believe that the %AB%CD notation
> > > is the canonical form.
> > >
> > > You make a good point about having to unescape the characters from
> > > the URI just above the transport layer, but I believe you're applying
> > >  section 4.1 to the URL when it should only be applied to the
> > > key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is
> > > the last URL to respond without an HTTP redirect, cannot be in
> > > unicode by the URI specification because unicode characters are not
> > > allowed, whether that is UTF8 or UTF16.
> > >
> > > Name/value pairs passed as part of a querystring may (and as the
> > > section you quote requires) be encoded as UTF-8, but they are
> > > subsequently URI encoded as %AB%CD hex characters (thus doubly
> > > encoded) so they are actually no longer UTF-8 at the transport layer.
> > >  Since the OpenID URL, around which all the identity of OpenID is
> > > focused (omiting XRIs which don't suffer from this problem) /is/ at
> > > the transport layer of the way the security requirements force the
> > > claimed identifier to be discovered, is all about the transport
> > > layer, I believe it would be a mistake to add semantics on top of
> > > that and call it canonical.
> > >
> > > What I also realized from some other conversations is that this
> > > doesn't really matter.  As long as an OP or RP is consistent within
> > > itself in storing and comparing Claimed Identifiers, whether it
> > > stores and compares %AB%CD or the unicode equivalent character won't
> > > matter to anyone, since on the protocol/wire level it is always
> > > %AB%CD.  However, I think unescaping the URL and getting the original
> > >  unicode characters back is very useful and should be done for
> > > purposes of displaying to the user.
> > >
> > > I think for the security and guaranteed identity of the protocol,
> > > there is a meaningful side to this though.  It has not got to do with
> > >  how the claimed identifier is stored, but rather how a unicode
> > > string is escaped for URI transport.  A given unicode string may be
> > > represented by more than just one series of bytes.  Unicode
> > > characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> > >  /for the same character/. Therefore someone who is typing in their
> > > OpenID url to a site using one method during one visit, and then
> > > types it in to the same site using a different method on a subsequent
> > >  visit, will only be identified by the RP as the same visitor if
> > > OpenID requires that the RP transforms whatever unicode string is
> > > given by the user to the canonical byte form as defined by the
> > > unicode standard before transit.  For example, the letter 'Á' can be
> > > encoded as a single character or using composition by adding an
> > > accent to the A character.  Both are legal, but the unicode standard
> > > defines one as canonical (I think).  But if a string containing this
> > > character is not canonicalized first, then although the character is
> > > equivalent to the user and to unicode, the encoded %AB%CD string will
> > > be different, resulting in security problems for OpenID because
> > > people could overload a single Identifier just by using different
> > > encodings at an OP, or fail to log into an RP depending on how they
> > > craft their string. By the way, I say 'unicode' in the strict sense,
> > > applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> > > just UTF-16, but this problem applies to all unicode character sizes.
> > >
> > >
> > >
> > >
> > > So I think OpenID should be more explicit about its unicode support
> > > for Identifiers, including mandating a canonical Unicode form.
> > >
> > > On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com
> > > <mailto:johnny.bufu <at> gmail.com>> wrote:
> > >
> > >
> > > On 08/07/08 03:01 PM, Andrew Arnott wrote:
> > >
> > > What is the canonical form of an OpenID URL? One with the %AB%CD hex
> > > encoding for unicode chars in the URL or with the actual unicode
> > > chars? For the purposes of displaying to the user and storing in the
> > > RP's database.
> > >
> > > The spec doesn't seem to have anything to say on this.
> > >
> > >
> > > I believe it does say:
> > >
> > > 4.1.  Protocol Messages The OpenID Authentication protocol messages
> > > are mappings of plain-text keys to plain-text values. The keys and
> > > values permit the full Unicode character set (UCS). When the keys and
> > >  values need to be converted to/from bytes, they MUST be encoded
> > > using UTF-8 [RFC3629].
> > >
> > > http://openid.net/specs/openid-authentication-2_0.html#anchor4
> > >
> > >
> > > The reason I think it's not a simple automatic answer is the unicode
> > > chars may be what the user typed in and what exists on the server,
> > > but in transit, these characters are translated to %AB%CD in order to
> > >  be validly escaped URI strings.
> > >
> > >
> > > The receiving party must decode them to the original form when they
> > > are extracted from the transport layer.
> > >
> > >
> > > So one could argue that the unicode characters are never part of the
> > > protocol
> > >
> > >
> > > One would then be ignoring the parts of the protocol that do not deal
> > >  with the transport layer directly.
> > >
> > >
> > > Johnny
> > >
> > >
> > > !DSPAM:139,48744d86221113907413095!
> > _______________________________________________
> > general mailing list
> > general <at> openid.net
> > http://openid.net/mailman/listinfo/general
>
> _______________________________________________
> general mailing list
> general <at> openid.net
> http://openid.net/mailman/listinfo/general


_______________________________________________
general mailing list
general <at> openid.net
http://openid.net/mailman/listinfo/general
Drummond Reed | 11 Jul 06:45

Re: [OpenID] Canonical OpenID url form

My apologies, I forgot to clarify that the XRI specifications require that when an XRI is transformed into an HTTP(S) URI (called an HXRI in the spec), it must be transformation into URI-normal form as defined in the XRI Syntax 2.0 spec [1]. That transformation (described earlier in this thread) involves a simple mechanical transformation into IRI-normal form, then following the IRI spec (RFC 3987) to apply the percent-encoding of Unicode characters.

 

=Drummond

 

[1] http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-cs.html

 

From: Andrew Arnott [mailto:andrewarnott <at> gmail.com]
Sent: Thursday, July 10, 2008 1:02 PM
To: Drummond Reed
Cc: Peter Williams; Johnny Bufu; OpenID List
Subject: Re: [OpenID] Canonical OpenID url form

 

If XRIs allow unicode characters and URIs do not, then prefixing http://xri.net/ in front of an XRI does not guarantee a proper URI.  It merely makes it look like one.  But if foreign characters exist in the XRI, they must be properly % encoded for the result to be a proper URI.

On Thu, Jul 10, 2008 at 11:31 AM, Drummond Reed <drummond.reed <at> cordance.net> wrote:

Martin's right, Peter -- XRI is one option for Unicode. But you can also use
an internationalized domain name
(http://en.wikipedia.org/wiki/Internationalized_domain_name) in a regular
URL. It uses Punycode (http://en.wikipedia.org/wiki/Punycode).

You can also turn an XRI into an URL by adding an XRI proxy resolver prefix
(such as http://xri.net/ -- see my sig below for an example). In that
approach the proxy resolver prefix has nothing to do with the XRI itself, so
there's no need to internationalize the domain name.

=Drummond
http://xri.net/=drummond.reed



> -----Original Message-----
> From: Peter Williams [mailto:pwilliams <at> rapattoni.com]
> Sent: Wednesday, July 09, 2008 11:40 PM
> To: Drummond Reed; 'Johnny Bufu'; 'Andrew Arnott'
> Cc: 'OpenID List'

> Subject: RE: [OpenID] Canonical OpenID url form
>
> So the short form of the story is: use xri for unicode (and then transform
> the xri into an https hxri).
>
> Its been a month since I studied xri (and thus have forgotten 80 percent
> of it). I recall there was a syntax to identify the address of the initial
> resolver. Is there a way tha this became the domain name componnt of the
> hxri
>
> -----Original Message-----
> From: Drummond Reed <drummond.reed <at> cordance.net>
> Sent: Wednesday, July 09, 2008 11:34 PM
> To: 'Johnny Bufu' <johnny.bufu <at> gmail.com>; 'Andrew Arnott'
> <andrewarnott <at> gmail.com>
> Cc: 'OpenID List' <general <at> openid.net>
> Subject: Re: [OpenID] Canonical OpenID url form
>
>
> Also for the record, XRIs (which use the IRI character set) have a very
> simple defined transformation into IRIs. Thus when an XRI needs to be sent
> over-the-wire in an HTTP(S) URI, it must first be transformed into an IRI,
> then you follow the IRI spec (RFC 3987) to transform into a URI as Johnny
> describes below. Reverse the process to display back to the user.
>
> See
> http://docs.oasis-open.org/xri/xri-syntax/2.0/specs/cs01/xri-syntax-V2.0-
> cs.
> html for all the gory details (and they are gory - Unicode is hard).
>
> =Drummond
>
> > -----Original Message-----
> > From: general-bounces <at> openid.net [mailto:general-bounces <at> openid.net] On
> > Behalf Of Johnny Bufu
> > Sent: Wednesday, July 09, 2008 10:52 PM
> > To: Andrew Arnott
> > Cc: OpenID List
> > Subject: Re: [OpenID] Canonical OpenID url form
> >
> > For the record, since this continued in an offline thread:
> >
> > The issue is around the User-Supplied Identifiers. OpenID defines them
> > as a type of Identifiers, which in turn defined as HTTP(S) URI or XRIs.
> > HTTP(S) URI do not allow non-ASCII characters.
> >
> > So, out of scope of OpenID, parties accepting IRIs (other than XRIs)
> > should follow the respective authoritative recommendations (i.e.
> > RFC3987) before presenting such strings to the OpenID layer as HTTP
> > URIs, and convert them back to IRI form later on when they need to be
> > displayed back to the users.
> >
> > Johnny
> >
> > On 08/07/08 10:32 PM, Andrew Arnott wrote:
> > > Thanks, Johnny.  I've had some conversations with a few other people
> > > who draw the opposite conclusion and believe that the %AB%CD notation
> > > is the canonical form.
> > >
> > > You make a good point about having to unescape the characters from
> > > the URI just above the transport layer, but I believe you're applying
> > >  section 4.1 to the URL when it should only be applied to the
> > > key/value pairs.  The OpenID ClaimedIdentifier, which by the spec is
> > > the last URL to respond without an HTTP redirect, cannot be in
> > > unicode by the URI specification because unicode characters are not
> > > allowed, whether that is UTF8 or UTF16.
> > >
> > > Name/value pairs passed as part of a querystring may (and as the
> > > section you quote requires) be encoded as UTF-8, but they are
> > > subsequently URI encoded as %AB%CD hex characters (thus doubly
> > > encoded) so they are actually no longer UTF-8 at the transport layer.
> > >  Since the OpenID URL, around which all the identity of OpenID is
> > > focused (omiting XRIs which don't suffer from this problem) /is/ at
> > > the transport layer of the way the security requirements force the
> > > claimed identifier to be discovered, is all about the transport
> > > layer, I believe it would be a mistake to add semantics on top of
> > > that and call it canonical.
> > >
> > > What I also realized from some other conversations is that this
> > > doesn't really matter.  As long as an OP or RP is consistent within
> > > itself in storing and comparing Claimed Identifiers, whether it
> > > stores and compares %AB%CD or the unicode equivalent character won't
> > > matter to anyone, since on the protocol/wire level it is always
> > > %AB%CD.  However, I think unescaping the URL and getting the original
> > >  unicode characters back is very useful and should be done for
> > > purposes of displaying to the user.
> > >
> > > I think for the security and guaranteed identity of the protocol,
> > > there is a meaningful side to this though.  It has not got to do with
> > >  how the claimed identifier is stored, but rather how a unicode
> > > string is escaped for URI transport.  A given unicode string may be
> > > represented by more than just one series of bytes.  Unicode
> > > characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> > >  /for the same character/. Therefore someone who is typing in their
> > > OpenID url to a site using one method during one visit, and then
> > > types it in to the same site using a different method on a subsequent
> > >  visit, will only be identified by the RP as the same visitor if
> > > OpenID requires that the RP transforms whatever unicode string is
> > > given by the user to the canonical byte form as defined by the
> > > unicode standard before transit.  For example, the letter 'Á' can be
> > > encoded as a single character or using composition by adding an
> > > accent to the A character.  Both are legal, but the unicode standard
> > > defines one as canonical (I think).  But if a string containing this
> > > character is not canonicalized first, then although the character is
> > > equivalent to the user and to unicode, the encoded %AB%CD string will
> > > be different, resulting in security problems for OpenID because
> > > people could overload a single Identifier just by using different
> > > encodings at an OP, or fail to log into an RP depending on how they
> > > craft their string. By the way, I say 'unicode' in the strict sense,
> > > applying to UTF-8, UTF-16, etc.  Unicode is commonly used to refer to
> > > just UTF-16, but this problem applies to all unicode character sizes.
> > >
> > >
> > >
> > >
> > > So I think OpenID should be more explicit about its unicode support
> > > for Identifiers, including mandating a canonical Unicode form.
> > >
> > > On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <johnny.bufu <at> gmail.com
> > > <mailto:johnny.bufu <at> gmail.com>> wrote:
> > >
> > >
> > > On 08/07/08 03:01 PM, Andrew Arnott wrote:
> > >
> > > What is the canonical form of an OpenID URL? One with the %AB%CD hex
> > > encoding for unicode chars in the URL or with the actual unicode
> > > chars? For the purposes of displaying to the user and storing in the
> > > RP's database.
> > >
> > > The spec doesn't seem to have anything to say on this.
> > >
> > >
> > > I believe it does say:
> > >
> > > 4.1.  Protocol Messages The OpenID Authentication protocol messages
> > > are mappings of plain-text keys to plain-text values. The keys and
> > > values permit the full Unicode character set (UCS). When the keys and
> > >  values need to be converted to/from bytes, they MUST be encoded
> > > using UTF-8 [RFC3629].
> > >
> > > http://openid.net/specs/openid-authentication-2_0.html#anchor4
> > >
> > >
> > > The reason I think it's not a simple automatic answer is the unicode
> > > chars may be what the user typed in and what exists on the server,
> > > but in transit, these characters are translated to %AB%CD in order to
> > >  be validly escaped URI strings.
> > >
> > >
> > > The receiving party must decode them to the original form when they
> > > are extracted from the transport layer.
> > >
> > >
> > > So one could argue that the unicode characters are never part of the
> > > protocol
> > >
> > >
> > > One would then be ignoring the parts of the protocol that do not deal
> > >  with the transport layer directly.
> > >
> > >
> > > Johnny
> > >
> > >
> > > !DSPAM:139,48744d86221113907413095!
> > _______________________________________________
> > general mailing list
> > general <at> openid.net
> > http://openid.net/mailman/listinfo/general
>
> _______________________________________________
> general mailing list
> general <at> openid.net
> http://openid.net/mailman/listinfo/general

 

_______________________________________________
general mailing list
general <at> openid.net
http://openid.net/mailman/listinfo/general
Martin Atkins | 11 Jul 09:20

Re: [OpenID] Canonical OpenID url form

Drummond Reed wrote:
> Martin's right, Peter -- XRI is one option for Unicode. But you can also use
> an internationalized domain name
> (http://en.wikipedia.org/wiki/Internationalized_domain_name) in a regular
> URL. It uses Punycode (http://en.wikipedia.org/wiki/Punycode).
> 

I hadn't thought of punycode. Certainly I think many of the existing 
implementations would struggle with unicode characters in the domain 
part of the URL. The spec doesn't really seem to say anything about this.

Should libraries be applying the mapping set out in RFC3987[1] section 
3.1 to incoming URLs? What about legacy servers that rely on their URLs 
not being encoded in UTF-8? The spec should probably say something about 
this, so that different implementations treat non-ASCII characters in an 
interoperable fashion. If it does already and I've missed it, then 
please point me to it!

Cheers,
Martin

[1] http://www.ietf.org/rfc/rfc3987
Drummond Reed | 11 Jul 18:25

Re: [OpenID] Canonical OpenID url form

> -----Original Message-----
> From: general-bounces <at> openid.net [mailto:general-bounces <at> openid.net] On
> Behalf Of Martin Atkins
> Sent: Friday, July 11, 2008 12:21 AM
> Cc: 'OpenID List'
> Subject: Re: [OpenID] Canonical OpenID url form
> 
> Drummond Reed wrote:
> > Martin's right, Peter -- XRI is one option for Unicode. But you can also
> use
> > an internationalized domain name
> > (http://en.wikipedia.org/wiki/Internationalized_domain_name) in a
> regular
> > URL. It uses Punycode (http://en.wikipedia.org/wiki/Punycode).
> >
> Martin Atkins wrote:
> I hadn't thought of punycode. Certainly I think many of the existing
> implementations would struggle with unicode characters in the domain
> part of the URL. The spec doesn't really seem to say anything about this.
> 
> Should libraries be applying the mapping set out in RFC3987[1] section
> 3.1 to incoming URLs? What about legacy servers that rely on their URLs
> not being encoded in UTF-8? The spec should probably say something about
> this, so that different implementations treat non-ASCII characters in an
> interoperable fashion. If it does already and I've missed it, then
> please point me to it!
> 
> Cheers,
> Martin
> 
> [1] http://www.ietf.org/rfc/rfc3987

Mart, I agree that: a) it's an important issue, and b) it's not adequately
covered in the spec yet.

=Drummond 

Gmane