RE: HTML5 and Unicode Normalization Form C
Leif Halvard Silli <xn--mlform-iua <at> xn--mlform-iua.no>
2011-06-01 01:26:09 GMT
( Adding www-validator <at> again. )
Phillips, Addison, Tue, 31 May 2011 09:34:23 -0700:
>> No problem. And it is true that my main focus is on linking.
> Linking is a special case. The IRI WG is also discussing
> normalization. That's the best place to deal with that issue, I
> think. Other comparisons in HTML (attributes and text values) do not
> have externally provided requirements and thus HTML (or CSS or...)
> need to define them.
Thanks for the tip w.r.t IRI WG - I've just subscribed.
Some more words on the HTML5 validator, though: Its current behaviour,
where non-NFC is stamped as an error, means that the HTML5 validator
does not perform - or display - IRI syntax warnings whenever decomposed
characters are used. Instead of giving a IRI relevant warning message,
the validator stamps the character as an outright error, regardless of
where it occurs ( <at> href or in "content").
By contrast, if one inserts a U+FF74 (a NFC, halfwith Katakana letter)
into <at> href, then the HTML5 validator gives a proper, IRI related
Warning: Bad value #ｴ for attribute href on element a:
Compatibility character in fragment component. [ snip ]
Syntax of IRI reference: [ snip ] Characters should be
represented in NFC and spaces should be escaped as %20.
If the HTML5 validator will issue a warning for used of decomposed in
content, then it should at least make sure to treat IRIs (in <at> href)
separete from "content" - they should not be conflated. There could be
a general warning against use of decomposed characters. But separate
from that, there should be a IRI warning as well.
>> HTML5 supports IRIs, which:  "Allows native representation of Unicode in
>> resources without % escaping".
> While this is a general way of defining IRIs, it's also misleading.
That excellent quote stems from one of the authors behind the IRI spec
- Michel Suignard. I like very much that it, in such plain and direct
English, explains the purpose of IRIs.
> While IRIs represent the vast preponderance of Unicode code points
> without escaping, percent escaping is still required in a number of
I accept this as your view of what needs to be communicated. From my
perspective, what the quote says, is important to communicate.
The IRI RFC is much duller than that quote. Coming from HTML4, where
non-ASCII inside <at> href and <at> id is forbidden, but where it is still
possible to use percent encoding (and the <at> name attribute in place of
<at> id) to represent non-ASCII, I want to see it explicitly stated that
direclty typed non-ASCII characters are allowed - they are not allowed
only if you escape them!
Btw, the section "Converting URIs to IRIs" in the IRI RFC,  points
to 3 other sections which defines restrictions, including the section
'Limitations on UCS Characters Allowed in IRIs'.  Despite the
restricitons, the purpose of IRI nevertheless is to allow non-ASCII
characters in URLs. (I suppose some of the restrictions, such as the
restriction on using halfwidth Katakana, is not a technical restriction
but a "philosophical" restriction, related to the need to avoid visual
look-alikes. As is the recommendation to use NFC.)
[ snip ]
>>>> As it has turned out, however, it was an error of the HTML5 validator
>>>> to show an error for use of NFC. But *that* only increases the
>>>> importance of offer helpful recommendations w.r.t. links.
>>> Thank you for the explanation of the background I wasn't aware of.
>> I should have pointed it out when I CC-ed this list. Sorry.
> If you have concerns about links/web addresses, the best place to
> discuss it is on public-iri <at> w3.org (the IETF IRI WG's mailing list).
> The IRI effort needs all the help it can get.
> As I mentioned before, my impression is that IRI is headed down the
> path of *not* requiring any particular normalization form, although
> NFC is recommended ("SHOULD") and early uniform normalization is
> explicitly assumed.
As told above, the HTML5 validator does implement that "SHOULD" with
regard to non-NFC in IRIs.
At least, it is my intepretation that, as long as it gets rid of the
general error message (and also do not introduce a similar,
indistinguishing, *warning*) for *any* use of decomposed letters, then
the HTML5 validator would still warn aginst use of non-NFC inside IRIs.
> Comparison of IRIs in the current draft addresses
> comparison by defining equivalence at the code point level. See:
It seems this is the most recent variant:
That section defines "character normalization" as part of "syntax-based
normalization". But none of the user agents of the dominating Web
browser families do include character/unicode normalization when they
compare IRI with <at> id. That they don't can indeed lead to "false
negatives". So it would be good if they did what the bis draft
I think we need to start by stating that two <at> id attributes in HTML5
are not to be considered as valid, "unique identifiers" if the only
difference between them, is the normalization form. Filed as a bug:
(Because, unless there is such a requirement that no two <at> id-s can
differ only with regard to the normalization, then the recommendation
of the IRI bis spec would mean that only the first occuring <at> id would
(BIS variant of :
leif halvard silli