Andreas Prilop | 26 May 17:41 2011
Picon

HTML5 and Unicode Normalization Form C

Validating
 http://www.user.uni-hannover.de/nhtcapri/temp/yerushalayim.html
results in error
 Text run is not in Unicode Normalization Form C.

Is Unicode Normalization Form C actually required by HTML5
or is this a validator bug?

Michael[tm] Smith | 26 May 18:46 2011
Picon

Re: HTML5 and Unicode Normalization Form C

Andreas Prilop <aprilop <at> freenet.de>, 2011-05-26 17:41 +0200:

> Validating
>  http://www.user.uni-hannover.de/nhtcapri/temp/yerushalayim.html
> results in error
>  Text run is not in Unicode Normalization Form C.
> 
> Is Unicode Normalization Form C actually required by HTML5
> or is this a validator bug?

It's not a validator bug, it's a feature -- in that it's intentional
behavior in the validator, and it's an attempt to provide you with
information about what could be seen as a potential portability problem in
your document. That said, if you're aware if it and you don't consider it a
real problem, you can ignore the error.

As far as whether Unicode Normalization Form C is actually required by
HTML5: the HTML5 spec does not directly state a requirement on Unicode
Normalization Form C, but I think that requirement is implicit in some
direct requirement that the spec does explicitly. I can't right now point
you to what the actual requirement is (because I don't know myself), but
I'll find out and post a follow-up message.

  --Mike

--

-- 
Michael[tm] Smith
http://people.w3.org/mike

(Continue reading)

Andreas Prilop | 27 May 16:35 2011
Picon

Re: HTML5 and Unicode Normalization Form C

On Fri, 27 May 2011, Michael[tm] Smith wrote:

>> Is Unicode Normalization Form C actually required by HTML5
>> or is this a validator bug?
>
> it's intentional behavior in the validator, and it's an attempt
> to provide you with information about what could be seen as a
> potential portability problem in your document.

None of your business!

And it is even wrong because Unicode NFC will cause problems even
on Windows XP as shown by
 http://www.user.uni-hannover.de/nhtcapri/temp/yerushalayim.html

The HTML5 validator does not complain about charset=ISO-8859-15.
Are you going to tell us that ISO-8859-15 is "better" than
non-NFC Unicode?

> you can ignore the error

You confessed that there is no error.

Michael[tm] Smith | 27 May 16:58 2011
Picon

Re: HTML5 and Unicode Normalization Form C

Andreas Prilop <aprilop <at> freenet.de>, 2011-05-27 16:35 +0200:

> On Fri, 27 May 2011, Michael[tm] Smith wrote:
> > you can ignore the error
> 
> You confessed that there is no error.

Indeed. Which is why I plan to switch it to being a warning. But if you
think it's wrong to even have it emit a warning, then let me know and I
talk to Henri and to the internationalization folks about whether it should
be or not. But from what I have been told by the internationalization folks
so far, I think they would like to for it to be generating a warning here.

  --Mike

--

-- 
Michael[tm] Smith
http://people.w3.org/mike

Andreas Prilop | 27 May 17:33 2011
Picon

Re: HTML5 and Unicode Normalization Form C

On Fri, 27 May 2011, Michael[tm] Smith wrote:

> But if you think it's wrong to even have it emit a warning,
> then let me know and I talk to Henri and to the internationalization
> folks about whether it should be or not. But from what I have been
> told by the internationalization folks so far, I think they would
> like to for it to be generating a warning here.

Thank you for the clarification!
In my opinion, you should not even emit a warning since Unicode
itself does not require NFC to be used everywhere.
It is the choice of the author to take any character encoding
and any valid Unicode representation. This has nothing to do
with "valid HTML" and should therefore not be reported by
an HTML validator.

But this question is not so important as long as there is
no error. Most important is that such a page (non-NFC UTF-8)
is still marked as "valid HTML5".

Please see also
 http://www.unicode.org/mail-arch/unicode-ml/y2011-m05/0075.html

Leif Halvard Silli | 29 May 19:21 2011
Picon

Re: HTML5 and Unicode Normalization Form C

Andreas Prilop, Fri, 27 May 2011 17:33:35 +0200 (CEST):
> On Fri, 27 May 2011, Michael[tm] Smith wrote:

>> But if you think it's wrong to even have it emit a warning,
>> then let me know and I talk to Henri and to the internationalization
>> folks about whether it should be or not. But from what I have been
>> told by the internationalization folks so far, I think they would
>> like to for it to be generating a warning here.

> In my opinion, you should not even emit a warning since Unicode
> itself does not require NFC to be used everywhere.
> It is the choice of the author to take any character encoding
> and any valid Unicode representation. This has nothing to do
> with "valid HTML" and should therefore not be reported by
> an HTML validator.

Actually, as discussed on www-international in February, use of non-NFC 
is is likely to be a surprising and hard to debug result of interaction 
with a tool or a file system which do not use/convert to NFC, rather 
than a conscious choice. [1]

Use of non-NFC in file names is a problem in itself: unless the URL 
uses the the same (de)composition, the file name and the link doesn't 
match. And even when e.g. a link and a file name both uses non-NFC, 
there might be interaction problems related to CSS in some user agents. 
(:visited and :link styling).

HTML5 already warns against use of non-UTF8 with the justification that 
it can problems, quote: [2] "form  submission and URL encodings". And 
hence, because non-NFC could cause the same kind of problems, a warning 
(Continue reading)

Koji Ishii | 29 May 21:15 2011
Picon

RE: HTML5 and Unicode Normalization Form C

I agree that NFC/NFD against strings to be compared helps a lot. URI and idref are good examples of such strings.

However, I'm against applying NFC to displayable contents. If you read XML 1.0 5th Edition carefully, it
suggests using NFC only for XML Names[1].

Unless Unicode resolves issues where NFC/NFD changes some glyphs, I believe that NFC/NFD are like
ignore-case; they're good to compare strings, but you don't want to lowercase whole contents.

My best preference is web servers to apply NFC/NFD as it receives URL from browsers just like they do
ignore-case, but if it's too difficult for some reasons, I can live with applying to attributes of
specific data types. I don't think applying NFC/NFD to whole contents is the right way to go.

[1] http://www.w3.org/TR/xml/#sec-suggested-names

Regards,
Koji

John Cowan | 29 May 22:05 2011

Re: HTML5 and Unicode Normalization Form C

Koji Ishii scripsit:

> However, I'm against applying NFC to displayable contents. If you
> read XML 1.0 5th Edition carefully, it suggests using NFC only for
> XML Names[1].

Actually, it suggests not using compatibility characters.  It's neutral
about precomposed (NFC) vs. decomposed (NFD).

--

-- 
John Cowan          http://www.ccil.org/~cowan        cowan <at> ccil.org
To say that Bilbo's breath was taken away is no description at all.  There are
no words left to express his staggerment, since Men changed the language that
they learned of elves in the days when all the world was wonderful. --The Hobbit

Phillips, Addison | 29 May 22:14 2011

RE: HTML5 and Unicode Normalization Form C

> 
> > However, I'm against applying NFC to displayable contents. If you read
> > XML 1.0 5th Edition carefully, it suggests using NFC only for XML
> > Names[1].
> 
> Actually, it suggests not using compatibility characters.  It's neutral about
> precomposed (NFC) vs. decomposed (NFD).

No, in Appendix J, item #3 says:

--
Characters in names should be expressed using Normalization Form C as defined in [UnicodeNormal].
--

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.



Leif Halvard Silli | 29 May 22:16 2011
Picon

Re: HTML5 and Unicode Normalization Form C

John Cowan, Sun, 29 May 2011 16:05:09 -0400:
> Koji Ishii scripsit:
> 
>> However, I'm against applying NFC to displayable contents. If you
>> read XML 1.0 5th Edition carefully, it suggests using NFC only for
>> XML Names[1].
> 
> Actually, it suggests not using compatibility characters.  It's neutral
> about precomposed (NFC) vs. decomposed (NFD).

Well, it doesn't sound neutral when XML 1.0 says: 

]] 
3.	Characters in names should be expressed using Normalization Form C 
as defined in [UnicodeNormal].
[[

The [UnicodeNormal] reference leads to 'Unicode normalization forms' 
[1].  However it appears a bit circular when it claims that "other W3C 
Specifications (such as XML 1.0 5th Edition) recommend using 
Normalization Form C for all content".  (XML 1.0 points to the report 
and the report points to XML 1.0.) And it doesn't seem that XML 1.0 
specifically recommends NFC "for all content".

[1] http://unicode.org/reports/tr15/
--

-- 
Leif Halvard Silli

Leif Halvard Silli | 30 May 02:06 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Koji Ishii, Sun, 29 May 2011 15:15:24 -0400:
> I agree that NFC/NFD against strings to be compared helps a lot. URI 
> and idref are good examples of such strings.
  [ snip ]
> Unless Unicode resolves issues where NFC/NFD changes some glyphs, I 
> believe that NFC/NFD are like ignore-case; they're good to compare 
> strings, but you don't want to lowercase whole contents.

So, is your proposal that validators should warn against non-NFC in 
links and identifiers, but else not?

Clearly, HTML5 and the HTML5 validator should help authors avoid 
gotchas. But, when thinking trouch some scenarios, it seems to be 
difficult to give the right kind of warning/advice in a validator.

Example: 

* For the Apache2 version that comes with Mac OS X, one might in 
principle use composed as well as decomposed links even if the file 
names are decomposed. In Apache on Mac OS X, there is, however, a 
single problem: cool, composed IRIs. E.g. 
	<http://example.com/%C3%A5.html> works, while 
	<http://example.com/%C3%A5> does not work. May be this is an Apache 
bug.
* In order to fix the above problem, which also lead customers to react 
when files were placed online, I started to use decomposed links:
    <http://example.com/a%CC%8A>

To say that I SHOULD use ad composed link rather than a decomposed link 
in that situation, perhaps would not be vice. OTOH, if the 
(Continue reading)

Leif Halvard Silli | 30 May 03:00 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Leif Halvard Silli, Mon, 30 May 2011 02:06:19 +0200:

> Clearly, HTML5 and the HTML5 validator should help authors avoid 
> gotchas. But, when thinking trouch some scenarios, it seems to be 
> difficult to give the right kind of warning/advice in a validator.
> 
> Example: 
> 
> * For the Apache2 version that comes with Mac OS X, one might in 
> principle use composed as well as decomposed links even if the file 
> names are decomposed. In Apache on Mac OS X, there is, however, a 
> single problem: cool, composed IRIs. E.g. 
> 	<http://example.com/%C3%A5.html> works, while 
> 	<http://example.com/%C3%A5> does not work. May be this is an Apache 
>   bug.
> * In order to fix the above problem, which also lead customers to react 
> when files were placed online, I started to use decomposed links:
>     <http://example.com/a%CC%8A>

Just discoverd, though, that Safari on Windows (but not on Mac) handles 
decomposed values in a unique way:

* in case of a de-composed fragment link, then Safari on Windows will 
target the composed identifier. If there is no composed identifier, 
then it will target nothing. Chrome, Safari-on-Mac and "all other" 
browsers treat them differently.

* in case of a de-composed cool IRI, it will not work at all. This is 
probably because Safari for Windows normalizes the cool IRI first: As 
already told, cool IRI is do not seem to work (whenever they contain 
(Continue reading)

Bjoern Hoehrmann | 30 May 03:16 2011
Picon
Picon

Re: HTML5 and Unicode Normalization Form C

* Leif Halvard Silli wrote:
>Just discoverd, though, that Safari on Windows (but not on Mac) handles 
>decomposed values in a unique way:
>
>* in case of a de-composed fragment link, then Safari on Windows will 
>target the composed identifier. If there is no composed identifier, 
>then it will target nothing. Chrome, Safari-on-Mac and "all other" 
>browsers treat them differently.

There has never been much consistency in this area as far as I am aware,
e.g. <http://lists.w3.org/Archives/Public/www-html/2002Oct/0002.html>.
--

-- 
Björn Höhrmann · mailto:bjoern <at> hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Koji Ishii | 30 May 04:10 2011
Picon

RE: HTML5 and Unicode Normalization Form C

It looks like all Leif cares is URL. Shouldn't it be covered in URL/IRI spec rather than in HTML 5 spec? I
haven't read in depth but RFC 3987[1] mentions normalizations in IRI.

I think it'd make sense for HTML5 spec and validator to follow URL/IRI spec for attributes that contain URL/IRI.

Whether to apply NFC/NFD to whole contents or not seems to be a little separate issue to me.

[1] http://www.ietf.org/rfc/rfc3987.txt

Regards,
Koji

Leif Halvard Silli | 30 May 04:57 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Koji Ishii, Sun, 29 May 2011 22:10:29 -0400:
> It looks like all Leif cares is URL.

All? As in "nothing more than"? 

On the contrary, to squarely look at URLs could mean that one looses 
compatibility. (E.g. if example.org/fönt is encoded in a way that is 
incompatible with the way your keyboard etc works.)

> Shouldn't it be covered in 
> URL/IRI spec rather than in HTML 5 spec? I haven't read in depth but 
> RFC 3987[1] mentions normalizations in IRI.

HTML5 does tries to define its own subset of that. See: 
http://www.w3.org/html/wg/href/draft (And HTML5 itself.)

> I think it'd make sense for HTML5 spec and validator to follow 
> URL/IRI spec for attributes that contain URL/IRI.

Do you expect text editors to encode content of attributes differnetly 
from content of other parts of the text file?

> Whether to apply NFC/NFD to whole contents or not seems to be a 
> little separate issue to me.

This thread started on www-validator <at>  and did not speak about "whole 
contents" or not - it only dealt with the fact that the HTML5 validator 
issued an error for non-NFC content. I have also seen that same error, 
and I thought - then - that it was based on HTML5.

(Continue reading)

Koji Ishii | 30 May 10:21 2011
Picon

RE: HTML5 and Unicode Normalization Form C

> Koji Ishii, Sun, 29 May 2011 22:10:29 -0400:
> > It looks like all Leif cares is URL.
> 
> All? As in "nothing more than"?

Ah...I apologize if it sounded like offending, I wanted to say examples you raised were related with URLs.
I'm still not sure if my wording was offending probably due to my English skill, but if you felt anything
bad, that wasn't my intention. I apologize that.


> > I think it'd make sense for HTML5 spec and validator to follow
> > URL/IRI spec for attributes that contain URL/IRI.
> 
> Do you expect text editors to encode content of attributes differnetly
> from content of other parts of the text file?

Yes for validators. URL/IRI has syntax like encoding using "%", so validation of attribute values using
its data type makes sense to me. If it wasn't the goal of the HTML5 validator, or if I'm asking too much, I'm
sorry for that.

But you're right that it could be a hard requirement for editors. If we take it seriously, I guess we have to
wait Unicode to fix NFC problems (I heard the effort is going on) or to ask web browsers/servers to
normalize on the fly. All options we have today have trade-offs, and I just wanted you to be aware of that
normalizing whole contents today can harm some scripts.


> > Whether to apply NFC/NFD to whole contents or not seems to be a
> > little separate issue to me.
> 
> This thread started on www-validator <at>  and did not speak about "whole
(Continue reading)

Leif Halvard Silli | 30 May 16:15 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Koji Ishii, Mon, 30 May 2011 04:21:45 -0400:
>> Koji Ishii, Sun, 29 May 2011 22:10:29 -0400:
>>> It looks like all Leif cares is URL.
>> 
>> All? As in "nothing more than"?
> 
> Ah...I apologize  [ snip ]

No problem. And it is true that my main focus is on linking.

>>> I think it'd make sense for HTML5 spec and validator to follow
>>> URL/IRI spec for attributes that contain URL/IRI.
>> 
>> Do you expect text editors to encode content of attributes differnetly
>> from content of other parts of the text file?
> 
> Yes for validators. URL/IRI has syntax like encoding using "%", so 
> validation of attribute values using its data type makes sense to me. 
> If it wasn't the goal of the HTML5 validator, or if I'm asking too 
> much, I'm sorry for that.

HTML5 supports IRIs, which: [1] "Allows native representation of 
Unicode in resources without % escaping". Or put differently: [2] "the 
desired Web address is stored in a document link or typed into the 
client's address bar using the relevant native characters".

> But you're right that it could be a hard requirement for editors. If 
> we take it seriously, I guess we have to wait Unicode to fix NFC 
> problems (I heard the effort is going on) or to ask web 
> browsers/servers to normalize on the fly. All options we have today 
(Continue reading)

Koji Ishii | 30 May 19:04 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Thank you for the understanding and I still feel sorry for my English skills. I've been wishing to learn
better but it never happened. Sigh.

> Which scripts could such a thing harm?

One I know is CJK Compatibility Block (U+F900-FAFF) I wrote before. The other I found on the web is in the
picture of this page[1] (text is in Japanese, sorry.) NFC transforms "U+1E0A U+0323" to "U+1E0A U+0307",
and you see the upper dot is painted at different position. It must be a bug in Word, and I don't know how bad it
is though.

I discussed the problem with Ken Lunde before. He's aware of the problem and he was thinking how to solve it.
So the hope is we might have better solution in future, but right now, we don't have a good tool that solves
linking problems without changing glyphs unfortunately.

[1] http://blog.antenna.co.jp/PDFTool/archives/2006/02/pdf_41.html

Regards,
Koji

-----Original Message-----
From: Leif Halvard Silli [mailto:xn--mlform-iua <at> målform.no] 
Sent: Monday, May 30, 2011 11:16 PM
To: Koji Ishii
Cc: www-international <at> w3.org
Subject: RE: HTML5 and Unicode Normalization Form C

Koji Ishii, Mon, 30 May 2011 04:21:45 -0400:
>> Koji Ishii, Sun, 29 May 2011 22:10:29 -0400:
>>> It looks like all Leif cares is URL.
>> 
(Continue reading)

Michel Suignard | 30 May 20:53 2011

RE: HTML5 and Unicode Normalization Form C

> One I know is CJK Compatibility Block (U+F900-FAFF) I wrote before. The other I found on the web is in the
picture of this page[1] (text is in Japanese, sorry.) NFC transforms "U+1E0A U+0323" to "U+1E0A U+0307",
and you see the upper dot is painted at different position. It must be a bug in Word, and I don't know how bad it
is though.

Please be careful, it is transformed from "U+1E0A U+0323" to "U+1E0C U+0307", (not "U+1E0A U+0307"). A
very different transform.

The rendering issue has nothing to do with Word. It just depends on how the font render either sequence which
may be slightly different. On a good Latin font they should be rendered the same. On my machine, Win7 with
Office 10, the rendering looks identical with Arial and Times New Roman which are designed to work well
with Latin combining marks, not as well on Calibri which is not designed that way.

The CJK Compatibility Block is altogether a different issue resulting from earlier design to make them
canonical equivalent to their unified equivalent which created the issue later when normalization was
introduced. 
It begs to find another way to encode them which is what probably Ken is alluding to.

Michel

Leif Halvard Silli | 31 May 00:37 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Koji Ishii, Mon, 30 May 2011 13:04:34 -0400:

>> Which scripts could such a thing harm?
> 
> One I know is CJK Compatibility Block (U+F900-FAFF) I wrote before. 
> The other I found on the web is in the picture of this page[1] (text 
> is in Japanese, sorry.) NFC transforms "U+1E0A U+0323" to "U+1E0A 
> U+0307", and you see the upper dot is painted at different position. 
> It must be a bug in Word, and I don't know how bad it is though.
> 
> I discussed the problem with Ken Lunde before. He's aware of the 
> problem and he was thinking how to solve it. So the hope is we might 
> have better solution in future, but right now, we don't have a good 
> tool that solves linking problems without changing glyphs 
> unfortunately.
> 
> [1] http://blog.antenna.co.jp/PDFTool/archives/2006/02/pdf_41.html


That article is about PDF, no? Normalizing problems related to PDFs is 
something I often see: Often, when I cocpy a the letter "å" from some 
PDF document, it turns out that the PDF stored it as de-composed. When 
I paste it into an editor, this might lead to funny problems. Now and 
then I have had to use a tool to convert it to NFC. 

I don't know if this is because PDF prefers de-composed letters, or 
what it is.

Unfortunaly, I don't 100% understand the issues that you take up in 
your web page. But it seems from Michel's comment that it is also a 
font issue. It is a very real problem that there are many fonts that do 
(Continue reading)

Koji Ishii | 31 May 14:42 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Thank you Michel and Leif, yeah, I confirmed that it was a font issue too. I'm sorry for writing information
without enough verification. The author was seeing an issue in his own PDF tool and saw it reproduces on
Word 2003 (way too old.)

So, spec-wise, the real issue is only in CJK Compatibility Block as far as I know. Other issues are
implementation bugs in fonts. I'm sorry for Microsoft.


Regards,
Koji

-----Original Message-----
From: Leif Halvard Silli [mailto:xn--mlform-iua <at> målform.no] 
Sent: Tuesday, May 31, 2011 7:38 AM
To: Koji Ishii
Cc: www-international <at> w3.org
Subject: RE: HTML5 and Unicode Normalization Form C

Koji Ishii, Mon, 30 May 2011 13:04:34 -0400:

>> Which scripts could such a thing harm?
> 
> One I know is CJK Compatibility Block (U+F900-FAFF) I wrote before. 
> The other I found on the web is in the picture of this page[1] (text 
> is in Japanese, sorry.) NFC transforms "U+1E0A U+0323" to "U+1E0A 
> U+0307", and you see the upper dot is painted at different position. 
> It must be a bug in Word, and I don't know how bad it is though.
> 
> I discussed the problem with Ken Lunde before. He's aware of the 
> problem and he was thinking how to solve it. So the hope is we might 
(Continue reading)

Phillips, Addison | 31 May 18:34 2011

RE: HTML5 and Unicode Normalization Form C

> 
> No problem. And it is true that my main focus is on linking.

Linking is a special case. The IRI WG is also discussing normalization. That's the best place to deal with
that issue, I think. Other comparisons in HTML (attributes and text values) do not have externally
provided requirements and thus HTML (or CSS or...) need to define them.

> 
> HTML5 supports IRIs, which: [1] "Allows native representation of Unicode in
> resources without % escaping". 

While this is a general way of defining IRIs, it's also misleading. While IRIs represent the vast
preponderance of Unicode code points without escaping, percent escaping is still required in a number of cases.

> 
> > But you're right that it could be a hard requirement for editors. If
> > we take it seriously, I guess we have to wait Unicode to fix NFC
> > problems (I heard the effort is going on) or to ask web
> > browsers/servers to normalize on the fly. 

Normalization is subject to Unicode's stability policy. I don't know what you think qualifies as "fixed",
but it will not take the form of changing either the definition of NFC or the properties of specific
characters. See: http://unicode.org/policies/stability_policy.html 

> >>
> >> As it has turned out, however, it was an error of the HTML5 validator
> >> to show an error for use of NFC. But *that* only increases the
> >> importance of offer helpful recommendations w.r.t. links.
> >
> > Thank you for the explanation of the background I wasn't aware of.
(Continue reading)

Mark Davis ☕ | 1 Jun 00:08 2011

Re: HTML5 and Unicode Normalization Form C


Mark

— Il meglio è l’inimico del bene —


On Tue, May 31, 2011 at 09:34, Phillips, Addison <addison <at> lab126.com> wrote:
>
> No problem. And it is true that my main focus is on linking.

Linking is a special case. The IRI WG is also discussing normalization. That's the best place to deal with that issue, I think. Other comparisons in HTML (attributes and text values) do not have externally provided requirements and thus HTML (or CSS or...) need to define them.

>
> HTML5 supports IRIs, which: [1] "Allows native representation of Unicode in
> resources without % escaping".

While this is a general way of defining IRIs, it's also misleading. While IRIs represent the vast preponderance of Unicode code points without escaping, percent escaping is still required in a number of cases.

>
> > But you're right that it could be a hard requirement for editors. If
> > we take it seriously, I guess we have to wait Unicode to fix NFC
> > problems (I heard the effort is going on) or to ask web
> > browsers/servers to normalize on the fly.

Normalization is subject to Unicode's stability policy. I don't know what you think qualifies as "fixed", but it will not take the form of changing either the definition of NFC or the properties of specific characters. See: http://unicode.org/policies/stability_policy.html

What this might be referring to is that we are looking at the use of IVSs for CJK compatibility characters. This would not change NFC, but would give people a way to maintain glyphic variants across NFC. For more info on IVS's, see http://unicode.org/ivd/, http://unicode.org/reports/tr37/.

(To make a very long story short, the CJK compatibility characters are a small fraction of those where people want to be able to have glyphic variants. By using IVS's instead of the CJK compatibility characters, people can ensure that their glyphic variants are correctly encoded — and in a way that is not affected by NFC. We're still in the process of looking at this, so stay tuned.)



> >>
> >> As it has turned out, however, it was an error of the HTML5 validator
> >> to show an error for use of NFC. But *that* only increases the
> >> importance of offer helpful recommendations w.r.t. links.
> >
> > Thank you for the explanation of the background I wasn't aware of.
>
> I should have pointed it out when I CC-ed this list. Sorry.

If you have concerns about links/web addresses, the best place to discuss it is on public-iri <at> w3.org (the IETF IRI WG's mailing list). The IRI effort needs all the help it can get.

As I mentioned before, my impression is that IRI is headed down the path of *not* requiring any particular normalization form, although NFC is recommended ("SHOULD") and early uniform normalization is explicitly assumed. Comparison of IRIs in the current draft addresses comparison by defining equivalence at the code point level. See: http://tools.ietf.org/html/draft-duerst-iri-bis-07#section-5.3.2

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.




Leif Halvard Silli | 1 Jun 03:26 2011
Picon

RE: HTML5 and Unicode Normalization Form C

( Adding www-validator <at>  again. )

Phillips, Addison, Tue, 31 May 2011 09:34:23 -0700:
>> 
>> No problem. And it is true that my main focus is on linking.
> 
> Linking is a special case. The IRI WG is also discussing 
> normalization. That's the best place to deal with that issue, I 
> think. Other comparisons in HTML (attributes and text values) do not 
> have externally provided requirements and thus HTML (or CSS or...) 
> need to define them.

Thanks for the tip w.r.t IRI WG - I've just subscribed.

Some more words on the HTML5 validator, though: Its current behaviour, 
where non-NFC is stamped as an error, means that the HTML5 validator 
does not perform - or display - IRI syntax warnings whenever decomposed 
characters are used. Instead of giving a IRI relevant warning message, 
the validator stamps the character as an outright error, regardless of 
where it occurs ( <at> href or in "content").

By contrast, if one inserts a U+FF74 (a NFC, halfwith Katakana letter) 
into  <at> href, then the HTML5 validator gives a proper, IRI related 
warning:

]]
Warning: Bad value #エ for attribute href on element a: 
Compatibility character in fragment component.  [ snip ]
Syntax of IRI reference: [ snip ] Characters should be 
represented in NFC and spaces should be escaped as %20.
[[

If the HTML5 validator will issue a warning for used of decomposed in 
content, then it should at least make sure to treat IRIs (in  <at> href) 
separete from "content" - they should not be conflated. There could be 
a general warning against use of decomposed characters. But separate 
from that, there should be a IRI warning as well. 

>> HTML5 supports IRIs, which: [1] "Allows native representation of Unicode in
>> resources without % escaping". 
> 
> While this is a general way of defining IRIs, it's also misleading. 

That excellent quote stems from one of the authors behind the IRI spec 
- Michel Suignard. I like very much that it, in such plain and direct 
English, explains the purpose of IRIs.

> While IRIs represent the vast preponderance of Unicode code points 
> without escaping, percent escaping is still required in a number of 
> cases.

I accept this as your view of what needs to be communicated. From my 
perspective, what the quote says, is important to communicate. 

The IRI RFC is much duller than that quote. Coming from HTML4, where 
non-ASCII inside  <at> href and  <at> id is forbidden, but where it is still 
possible to use percent encoding (and the  <at> name attribute in place of 
 <at> id) to represent non-ASCII, I want to see it explicitly stated that 
direclty typed non-ASCII characters are allowed - they are not allowed 
only if you escape them!

Btw, the section "Converting URIs to IRIs" in the IRI RFC, [1] points 
to 3 other sections which defines restrictions, including the section 
'Limitations on UCS Characters Allowed in IRIs'. [2] Despite the 
restricitons, the purpose of IRI nevertheless is to allow non-ASCII 
characters in URLs. (I suppose some of the restrictions, such as the 
restriction on using halfwidth Katakana, is not a technical restriction 
but a "philosophical" restriction, related to the need to avoid visual 
look-alikes. As is the recommendation to use NFC.)

 [ snip ]

>>>> As it has turned out, however, it was an error of the HTML5 validator
>>>> to show an error for use of NFC. But *that* only increases the
>>>> importance of offer helpful recommendations w.r.t. links.
>>> 
>>> Thank you for the explanation of the background I wasn't aware of.
>> 
>> I should have pointed it out when I CC-ed this list. Sorry.
> 
> If you have concerns about links/web addresses, the best place to 
> discuss it is on public-iri <at> w3.org (the IETF IRI WG's mailing list). 
> The IRI effort needs all the help it can get.
> 
> As I mentioned before, my impression is that IRI is headed down the 
> path of *not* requiring any particular normalization form, although 
> NFC is recommended ("SHOULD") and early uniform normalization is 
> explicitly assumed.

As told above, the HTML5 validator does implement that "SHOULD" with 
regard to non-NFC in IRIs. 

At least, it is my intepretation that, as long as it gets rid of the 
general error message (and also do not introduce a similar, 
indistinguishing, *warning*) for *any* use of decomposed letters, then 
the HTML5 validator would still warn aginst use of non-NFC inside IRIs.

> Comparison of IRIs in the current draft addresses 
> comparison by defining equivalence at the code point level. See: 
> http://tools.ietf.org/html/draft-duerst-iri-bis-07#section-5.3.2 

It seems this is the most recent variant:
http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-5.3.2 

That section defines "character normalization" as part of "syntax-based 
normalization".  But none of the user agents of the dominating Web 
browser families do include character/unicode normalization when they 
compare IRI with  <at> id. That they don't can indeed lead to "false 
negatives". So it would be good if they did what the bis draft 
recommmends. 

I think we need to start by stating that two  <at> id attributes in HTML5 
are not to be considered as valid, "unique identifiers" if the only 
difference between them, is the normalization form. Filed as a bug: 
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12839


(Because, unless there is such a requirement that no two  <at> id-s can 
differ only with regard to the normalization, then the recommendation 
of the IRI bis spec would mean that only the first occuring  <at> id would 
be found.)

[1] http://tools.ietf.org/html/rfc3987#section-3.2 
    (BIS variant of [1]: 
http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.7 )
[2] http://tools.ietf.org/html/rfc3987#section-6.1

-- 
leif halvard silli
John Cowan | 1 Jun 05:51 2011

Re: HTML5 and Unicode Normalization Form C

Leif Halvard Silli scripsit:

> Warning: Bad value #エ for attribute href on element a:
> Compatibility character in fragment component.  [ snip ] Syntax of
> IRI reference: [ snip ] Characters should be represented in NFC and
> spaces should be escaped as %20.

This warning is no good anyway, since U+FF74 is permitted in NFC texts
(though not in NKFC texts, where must be replaced by U+30A8).

--

-- 
John Cowan    http://ccil.org/~cowan    cowan <at> ccil.org
SAXParserFactory [is] a hideous, evil monstrosity of a class that should
be hung, shot, beheaded, drawn and quartered, burned at the stake,
buried in unconsecrated ground, dug up, cremated, and the ashes tossed
in the Tiber while the complete cast of Wicked sings "Ding dong, the
witch is dead."  --Elliotte Rusty Harold on xml-dev

Leif Halvard Silli | 30 May 04:38 2011
Picon

Re: HTML5 and Unicode Normalization Form C

Bjoern Hoehrmann, Mon, 30 May 2011 03:16:02 +0200:
> * Leif Halvard Silli wrote:
>> Just discoverd, though, that Safari on Windows (but not on Mac) handles 
>> decomposed values in a unique way:
>> 
>> * in case of a de-composed fragment link, then Safari on Windows will 
>> target the composed identifier. If there is no composed identifier, 
>> then it will target nothing. Chrome, Safari-on-Mac and "all other" 
>> browsers treat them differently.
> 
> There has never been much consistency in this area as far as I am aware,
> e.g. <http://lists.w3.org/Archives/Public/www-html/2002Oct/0002.html>.

Those test results are of course interesting. A retest where you looked 
at today's browsers would be interesting too!

But nevertheless, that test is not completely relevant to the issue at 
hand. HTML5 says that URLs are to be resolved in accordance with the 
encoding. And HTML5 does not say that de-composed and composed values 
should be treated equally either.

The tests I have performed are not experiments where I try to test 
whether values that should (or should not) be seen as equal really are 
seen as equal, with the goal of discovering as many bugs as possible. 
Instead, I accept how UAs treat NFC and NFD differently. (I all started 
when I tried to discover why on earth links that worked find on my mac 
did not work elsewhere.)

Safari on Windows is an odd exception, even in the Webkit family. I 
mentioned it only because (I think) it sheds some light on other issues 
that I meanted. It is, as well, an example of how using NFC really 
*can* improve browser compatibility.
--

-- 
Leif H Silli

Phillips, Addison | 29 May 22:54 2011

RE: HTML5 and Unicode Normalization Form C

> 
> As for using non-NFC outside attributes, then I don't know if there are issues
> which can justify a warning. But according to Unicode technical report 15, then
> the "W3C Character Model for the World Wide Web [ snip ] and other W3C
> Specifications (such as XML 1.0 5th Edition) recommend using Normalization
> Form C for all content." [4]
> 

There has been some confusion about what Charmod-Norm says (and what the Internationalization WG thought
it meant when it said it). I'd like to clarify somewhat. Please note that this is a *personal* email, with my
chair hat off.

The normative bits of Charmod-Norm live at [1]. Items C300 and C301 use the RFC 2119 keyword "SHOULD" in
requiring that content and specifications be fully-normalized or include-normalized. These
requirements used to say "MUST" because the original intent was that "early uniform normalization"
(EUN) would be required by the Character Model.

In 2004/2005, the Internationalization Working Group decided that early uniform normalization was dead
and that requiring normalization of content (such that applications could assume that content was
already normalized) was no longer a reasonable position for Charmod. The debate was whether to relax the
"MUST" requirement to "SHOULD", to "MAY", or whether it should be removed altogether. The WG felt, at that
time, that normalized content was desirable even if applications and formats could not count on
normalization having been applied. Therefore, the recommendation was kept at "SHOULD" rather than the
weaker "MAY" (or removed altogether). Further, it was felt that new formats might wish to require
normalization even if existing formats did not.

It would be unreasonable, in my opinion, to treat HTML5 as a *new* format, so I think any expectations for
adding a normalization requirement to HTML are unrealistic.

Having dropped EUN, other requirements were added or modified to deal with the fact that content would not
be ensured to be in a normalized form. The 2119 keyword "SHOULD" has a very strong normative meaning (only a
little bit less strong than "MUST"), but the WG's intent was significantly less strong. Once you cannot
assume that content is normalized, one must perform normalization sensitive operations carefully or
suffer the consequences.

Charmod-Norm was not intended to be advanced in its current form for precisely the reasons we are
discussing on this thread. Removing EUN means additional complexity, since specifications and formats
must then deal with normalization independently, especially when it comes to things such as
identifiers. The I18N Core WG has recently agreed to work on normalization guidelines again. There is
(and has ever been) little enthusiasm for working on the Character Model, but having read the
normalization document again this weekend, I suspect that Charmod-Norm will probably have to be
replaced, rather than just worked around.

HTH,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

[1] http://www.w3.org/TR/charmod-norm/#sec-NormalizationApplication

Leif Halvard Silli | 30 May 03:43 2011
Picon

RE: HTML5 and Unicode Normalization Form C

Phillips, Addison, Sun, 29 May 2011 13:54:34 -0700:
>> 
>> As for using non-NFC outside attributes, then I don't know if 
>> there are issues which can justify a warning. But according
>> to Unicode technical report 15, then the "W3C Character Model
>> for the World Wide Web [ snip ] and other W3C Specifications
>> (such as XML 1.0 5th Edition) recommend using Normalization
>> Form C for all content." [4]
  [...]
> The normative bits of Charmod-Norm live at [1]. Items C300 and C301 
> use the RFC 2119 keyword "SHOULD" in requiring that content and 
> specifications be fully-normalized or include-normalized.
  [...]
> It would be unreasonable, in my opinion, to treat HTML5 as a *new* 
> format, so I think any expectations for adding a normalization 
> requirement to HTML are unrealistic.

However, HTML5 warns against not using UTF-8 because of "unexpected 
results" in form submissions and links of not doing so. It would seem 
in tune with this spirit to, if possible, let HTML5/validators point to 
how to eliminate the problems that can cause unexpected resulted even 
with UTF-8, no?

Btw, it seems to be unclear, from HTML5, whether two  <at> id attributes 
that only differs with regard to their normalization, are to be 
considered uniqe. All HTML5 says is said is that  <at> id attributes must be 
unique, but it is not said what actually makes them unique. [1]

Related to the uniqueness: 
  * On the Mac, when serving a file on the preinstalled Apache2, then 
normalized link values (provided they are not cool IRIs with decomposed 
letters) do target files with non-normalized file names. How come? Is 
it because Apache performs a normalization of the HTTP request? 
  * Inside a document, however (with the exception of Safari on windows 
[2]), then composed and decomposed identifiers are treated by browsers 
as distinct identifiers, though. 

  [...]
> The I18N Core WG has recently agreed 
> to work on normalization guidelines again. There is (and has ever 
> been) little enthusiasm for working on the Character Model, but 
> having read the normalization document again this weekend, I suspect 
> that Charmod-Norm will probably have to be replaced, rather than just 
> worked around.

Good hear your are looking at it!

> [1] http://www.w3.org/TR/charmod-norm/#sec-NormalizationApplication

[1] http://dev.w3.org/html5/spec/elements.html#the-id-attribute
[2] http://lists.w3.org/Archives/Public/www-validator/2011May/0052
--

-- 
Leif H Silli

Leif Halvard Silli | 29 May 18:53 2011
Picon

Re: HTML5 and Unicode Normalization Form C

Andreas Prilop, Fri, 27 May 2011 16:35:11 +0200 (CEST):

> The HTML5 validator does not complain about charset=ISO-8859-15.
> Are you going to tell us that ISO-8859-15 is "better" than
> non-NFC Unicode?

The HTML5 validator could very well show a warning for displaying 
ISO-8859-15. HTML5 excplicitly allows such a warning. Mike, would you 
add a such a warning? Consider it a feature request from my part!

This is what HTML5 says: [1]

]]
Authors are encouraged to use UTF-8. Conformance checkers may advise 
authors against using legacy encodings. [RFC3629]

Authoring tools should default to using UTF-8 for newly-created 
documents. [RFC3629]
  [ snip ]
Using non-UTF-8 encodings can have unexpected results on form 
submission and URL encodings, which use the document's character 
encoding by default.
[[

[1] http://www.w3.org/TR/html5/semantics.html#charset
--

-- 
Leif Halvard Silli

Michael[tm] Smith | 27 May 16:42 2011
Picon

Re: HTML5 and Unicode Normalization Form C

"Michael[tm] Smith" <mike <at> w3.org>, 2011-05-27 01:46 +0900:

> As far as whether Unicode Normalization Form C is actually required by
> HTML5: the HTML5 spec does not directly state a requirement on Unicode
> Normalization Form C, but I think that requirement is implicit in some
> direct requirement that the spec does explicitly. I can't right now point
> you to what the actual requirement is (because I don't know myself), but
> I'll find out and post a follow-up message.

So this is the follow-up message.

First thing I should say is that I was wrong: The HTML5 spec does not have
any requirement for NFC at all, not even indirectly.

Second thing: So because of the first thing, the validator should not be
emitting an error for non-NFC stuff. But I think it is useful to have it to
instead have it emit a warning, and some others I've talked with who are
more knowledgeable than me about NFC agree, so what I'm going to do is, flip
the validator code to make it emit a warning instead of an error. Then I'll
update the W3C backends for the HTML5 facet and re-deploy them, by some
time early next week (which will also pull in a bunch of new and useful
changes to the backend that Henri Sivonen recently checked in upstream).

  --Mike

--

-- 
Michael[tm] Smith
http://people.w3.org/mike

Leif Halvard Silli | 1 Jun 03:46 2011
Picon

Re: HTML5 and Unicode Normalization Form C

Michael[tm] Smith, Fri, 27 May 2011 23:42:24 +0900:
> But I think it is useful to have it to
> instead have it emit a warning, and some others I've talked with who are
> more knowledgeable than me about NFC agree, so what I'm going to do is, flip
> the validator code to make it emit a warning instead of an error. Then I'll
> update the W3C backends for the HTML5 facet and re-deploy them, by some
> time early next week (which will also pull in a bunch of new and useful
> changes to the backend that Henri Sivonen recently checked in upstream).

The current behaviour makes the validator hide some issues that perhaps 
are more important than the use of decomposed characters in "content":

1) In case of the following document, then the validator - errouneously 
- does point out the use of decomposed values, but does *not* point out 
that the two  <at> id attributes are aqually equal (because they only differ 
with regard to their normalization), and thus should be considered not 
unique and thus invalid:

  <!DOCTYPE html><title></title><p id="a&#x30a;"><p id="&#xe5;">

 (Related bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12839 )

  It seems more important to point out that the two  <at> id-s have the same 
value, than it is to point out that one of them uses decomposed form. 
So, if you decided to have a warning against use of non-NFC, then you 
must take care to not make this "suffocate" the error message that the 
above markup should create.

2) The same can be said about the value of  <at> href - any warning message 
that you choose to show with regard to use of decomposed values as 
such, should not not (as it currently does) cause that IRI 
errors/warnings, are not not displayed.

This is yet another reason to not show a general error for use of 
non-NFC and a reason to be careful that the warning that you plan do 
not cause that other necessary checks are not performed (or not 
reported to the validator user).
--

-- 
leif halvard silli


Gmane