Martin J. Duerst | 26 May 1999 07:38
Picon
Favicon

Special characters in URIs

This is a question for background information from the URI/URL
community.

The fact that URIs (RFC 2396) don't define the character semantics
of the byte values they encode has been discussed on various occasions.

To alleviate the problem, various URL schemes have started to base
themselves on UTF-8, and some formats that carry URIs have defined
error behaviour based on UTF-8.

The second case basically works by saying that if in these formats
(e.g. HTML), an URI contains a non-ASCII character, this character
is converted to a byte sequence using UTF-8 and then %-encoded to
produce a legal URI.

The question now has come up whether this behaviour can be extended
to characters in the ASCII range, i.e. any of:

 control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
 space       = <US-ASCII coded character 20 hexadecimal>
 delims      = "<" | ">" | "#" | "%" | <">
 unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

'#' and '%' of course have to stay excluded. For the formats in
question (mostly XML), control characters are not allowed anyway.
"<", <">, ... would appear as &lt;, &amp;,... only anyway.
Space would have to be used with caution because collapsing
rules might apply.

So the question is mainly about the rest:
(Continue reading)

Larry Masinter | 27 May 1999 23:28
Picon
Favicon

RE: Special characters in URIs

URL character escaping normally should only be done at the
time the URL is constructed from its component pieces, and
normally should only be undone (unescaped) when the URL
is decomposed into its internal pieces.  Your description
of the process of either applying or removing %XX escaping
seems to be based on having the escapes applied or removed
when the URL is removed from or embedded in some context
such as XML. In general, you cannot change an arbitrary
%XX into the character the XX byte sequence represents in
ASCII without some risk of changing the meaning of the URL,
and so you should not recommend this process at all.

Larry
--

-- 
http://www.parc.xerox.com/masinter

> The second case basically works by saying that if in these formats
> (e.g. HTML), an URI contains a non-ASCII character, this character
> is converted to a byte sequence using UTF-8 and then %-encoded to
> produce a legal URI.

I think "works" is ambitious. It "works" because most
HTTP servers are forgiving about this kind of transliteration
and most URLs are HTTP.

Dan Connolly | 28 May 1999 00:41
Picon
Favicon

Re: Special characters in URIs

Larry Masinter wrote:
> 
> URL character escaping normally should only be done at the
> time the URL is constructed from its component pieces, and
> normally should only be undone (unescaped) when the URL
> is decomposed into its internal pieces.

True.

>  Your description
> of the process of either applying or removing %XX escaping
> seems to be based on having the escapes applied or removed
> when the URL is removed from or embedded in some context
> such as XML.

only when it's removed

> In general, you cannot change an arbitrary
> %XX into the character the XX byte sequence represents in
> ASCII without some risk of changing the meaning of the URL,

true.

> and so you should not recommend this process at all.

The excerpt below doesn't mention unescaping. Only how
to take an XML attribute value and turn it into a URL
in the case that it's not already a URL (because it
has non-URL characters).

(Continue reading)

Larry Masinter | 29 May 1999 00:17
Picon
Favicon

RE: Special characters in URIs

(I'm hoping that uri <at> bunyip.com will migrate to uri <at> w3.org,
although I've not gotten an acknowledgement. I suppose people
should look for news at http://www.ics.uci.edu/pub/ietf/uri )

> It "works" in the case that, for example, a user copies
> a filename from a desktop filebrowser into an XML document
> 	href="xyz__"
> where __ is some non-URL character.

This works for me if you say that what's in the XML document
attribute isn't really a "URI" but rather something else.
For example, we could use the "IURI" draft to define what
appears in XML, and note that in order to turn it into a URI,
it needs to be escaped. I don't have a problem with that.

> Meanwhile, the HTTP server, when it exports the xyz__ file,
> uses the same convention: UTF-8 encoding, %XX escaped.
> 
> That doesn't mean the HTTP server should grab xyz%XX%XX off
> the tcp socket and unescape it; it means the HTTP server
> should (do something equivalent to) enumerate each file
> in the directory and escape it, and compare the resultin URI path
> to xyz%XX%XX.

Right.

> It's a bit of a kludge; the cleaner thing to do would
> be to say "don't put things other than URIs in those
> XML attribute values." But we haven't had any luck doing that.
> And this "kludge" just so happens to be consistent with
(Continue reading)


Gmane