Ned Freed | 6 Sep 22:30

Re: Volunteer needed to serve as IANA charset reviewer

> (IETF list removed, since this is about to become specialized)

> --On Wednesday, 06 September, 2006 11:04 -0700 Ted Hardie
> <hardie <at> qualcomm.com> wrote:

> > The Applications Area is soliciting volunteers willing to
> > serve as the IANA charset reviewer.  This position entails
> > reviewing charset registrations submitted to IANA in
> > accordance with the procedures set out in RFC 2978.  It
> > requires the reviewer to monitor discussion on the
> > ietf-charsets mailing list (moderating it, if necessary); it
> > also requires that the reviewer interact with the registrants
> > and IANA on the details of the registration.  There is
> > currently a small backlog, and it will be necessary to work to
> > resolve that backlog during the initial period of the
> > appointment.
> >...

> Perhaps the need for a new volunteer in this area is the time to
> ask a broader question:

> At the time 2978 (and its predecessor, 2278) were defined, there
> were a large number of charsets in heavy use and there was some
> general feeling in the implementer community that, despite the
> provisions of RFC 2277, Unicode/ISO 10646 were not quite ready.
> Although we probably still have some distance to go (the issues
> with my net-Unicode draft may be illustrative), I wonder if we
> are reaching the point at which a stronger "use Unicode on the
> wire" recommendation would be in order.   The implications of
> such a recommendation would presumably include a 2978bis that
(Continue reading)

Mark Davis | 7 Sep 01:44
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the registry, but that are in *far* more widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html

2. Ill-defined registrations (crucial)
  a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.
  b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.
  c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html

As a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows vs Linux...), by programming language [Java) or by version of programming language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all. We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf. http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information. As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

Mark


> I agree that we've reached a point where "use UTF-8" is what we need to be
> pushing for in new protocol development. (Note that I said UTF-8 and not
> Unicode - given the existance of gb18030 [*] I don't regard a recommendation of
> "use Unicode" as even close to sufficient. The last thing we want is to see the
> development of specializesd Unicode CESes for Korean, Japanese, Arabic, Hebrew,
> Thai, and who knows what else.) And if the reason there are new charset
> registrations was because of the perceived need to have new charsets for use in
> new protocols, I would be in total agreement that a change in focus for charset
> registration is in order.
>
> But that's not why we're seeing new registrations. The new registrations we're
> seeing are of legacy charsets used in legacy applications and protocols that
> for whatever reason never got registered previously. Given that these things
> are in use in various nooks and crannies around the world, it is critically
> important that when they are used they are labelled accurately and
> consistently.
>
> The plain fact of the matter is that we have done a miserable job of producing
> an accurate and useful charset registry, and considerable work needs to be done
> both to register various missing charsets as well as to clean up the existing
> registry, which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind pushing back on, say,
> the recent registration of iso-8859-11, is an overreaction to a non-problem.
> [**]
>
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.   More options
> > and possibilities for local codings that are not generally known
> > and supported do not help with interoperability; perhaps it is
> > time to start pushing back.
>
> Well, I have to say that to the extent we've pushed back on registrations, what
> we've ended up with is ad-hoc mess of unregistered usage. I am therefore quite
> skeptical of any belief that pushing back on registrations is a useful tactic.
>
> > And that, of course, would dramatically change the work of the
> > charset reviewer by reducing the volume but increasing the
> > amount of evaluation to be done.
>
> Even if we closed the registry completely there is still a bunch of work to do
> in terms of registry cleanup.
>
> Now, having said all this, I'm willing to take on the role of charset reviewer,
> but with the understanding that one of the things I will do is conduct a
> complete overhaul of the existing registry. [***] Such a substantive change will
> of course require some degree of oversight, which in turn means I'd like to see
> some commitment from the IESG of support for the effort.
>
> As for qualifications, I did write the charset registration specification, and
> I also wrote and continue to maintain a fairly full-features charset conversion
> library. I can provide more detail if anyone cares.
>
>                                 Ned
>
> [*] - For those not fully up to speed on this stuff, gb18030 can be seen as an
> encoding of Unicode that is backwards compatible with the previous simplified
> Chinese charsets gb2312 and gbk.
>
> [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> interesting case. I believe this one needed to be pushed on, but not
> because of potential use in new applications or protocols.
>
> [***] - I have the advantage of being close enough to IANA that I can drive
> over there and have F2F meetings should the need arise - and I suspect
> it will.
>

Mark Davis | 7 Sep 03:47
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

This appears to have bounced from ietf-charsets <at> iana.org on first send. -- MD

On 9/6/06, Mark Davis < mark.davis <at> icu-project.org> wrote:
If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the registry, but that are in *far* more widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html

2. Ill-defined registrations (crucial)
  a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.
  b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.
  c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html

As a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows vs Linux...), by programming language [Java) or by version of programming language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all. We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf. http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information. As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

Mark



> I agree that we've reached a point where "use UTF-8" is what we need to be
> pushing for in new protocol development. (Note that I said UTF-8 and not
> Unicode - given the existance of gb18030 [*] I don't regard a recommendation of
> "use Unicode" as even close to sufficient. The last thing we want is to see the
> development of specializesd Unicode CESes for Korean, Japanese, Arabic, Hebrew,
> Thai, and who knows what else.) And if the reason there are new charset
> registrations was because of the perceived need to have new charsets for use in
> new protocols, I would be in total agreement that a change in focus for charset
> registration is in order.
>
> But that's not why we're seeing new registrations. The new registrations we're
> seeing are of legacy charsets used in legacy applications and protocols that
> for whatever reason never got registered previously. Given that these things
> are in use in various nooks and crannies around the world, it is critically
> important that when they are used they are labelled accurately and
> consistently.
>
> The plain fact of the matter is that we have done a miserable job of producing
> an accurate and useful charset registry, and considerable work needs to be done
> both to register various missing charsets as well as to clean up the existing
> registry, which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind pushing back on, say,
> the recent registration of iso-8859-11, is an overreaction to a non-problem.
> [**]
>
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.   More options
> > and possibilities for local codings that are not generally known
> > and supported do not help with interoperability; perhaps it is
> > time to start pushing back.
>
> Well, I have to say that to the extent we've pushed back on registrations, what
> we've ended up with is ad-hoc mess of unregistered usage. I am therefore quite
> skeptical of any belief that pushing back on registrations is a useful tactic.
>
> > And that, of course, would dramatically change the work of the
> > charset reviewer by reducing the volume but increasing the
> > amount of evaluation to be done.
>
> Even if we closed the registry completely there is still a bunch of work to do
> in terms of registry cleanup.
>
> Now, having said all this, I'm willing to take on the role of charset reviewer,
> but with the understanding that one of the things I will do is conduct a
> complete overhaul of the existing registry. [***] Such a substantive change will
> of course require some degree of oversight, which in turn means I'd like to see
> some commitment from the IESG of support for the effort.
>
> As for qualifications, I did write the charset registration specification, and
> I also wrote and continue to maintain a fairly full-features charset conversion
> library. I can provide more detail if anyone cares.
>
>                                 Ned
>
> [*] - For those not fully up to speed on this stuff, gb18030 can be seen as an
> encoding of Unicode that is backwards compatible with the previous simplified
> Chinese charsets gb2312 and gbk.
>
> [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> interesting case. I believe this one needed to be pushed on, but not
> because of potential use in new applications or protocols.
>
> [***] - I have the advantage of being close enough to IANA that I can drive
> over there and have F2F meetings should the need arise - and I suspect
> it will.
>

Mark Davis | 7 Sep 03:47
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

This appears to have bounced from ietf-charsets <at> iana.org on first send. -- MD

On 9/6/06, Mark Davis < mark.davis <at> icu-project.org> wrote:
If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the registry, but that are in *far* more widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html

2. Ill-defined registrations (crucial)
  a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.
  b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.
  c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html

As a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows vs Linux...), by programming language [Java) or by version of programming language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all. We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf. http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information. As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

Mark



> I agree that we've reached a point where "use UTF-8" is what we need to be
> pushing for in new protocol development. (Note that I said UTF-8 and not
> Unicode - given the existance of gb18030 [*] I don't regard a recommendation of
> "use Unicode" as even close to sufficient. The last thing we want is to see the
> development of specializesd Unicode CESes for Korean, Japanese, Arabic, Hebrew,
> Thai, and who knows what else.) And if the reason there are new charset
> registrations was because of the perceived need to have new charsets for use in
> new protocols, I would be in total agreement that a change in focus for charset
> registration is in order.
>
> But that's not why we're seeing new registrations. The new registrations we're
> seeing are of legacy charsets used in legacy applications and protocols that
> for whatever reason never got registered previously. Given that these things
> are in use in various nooks and crannies around the world, it is critically
> important that when they are used they are labelled accurately and
> consistently.
>
> The plain fact of the matter is that we have done a miserable job of producing
> an accurate and useful charset registry, and considerable work needs to be done
> both to register various missing charsets as well as to clean up the existing
> registry, which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind pushing back on, say,
> the recent registration of iso-8859-11, is an overreaction to a non-problem.
> [**]
>
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.   More options
> > and possibilities for local codings that are not generally known
> > and supported do not help with interoperability; perhaps it is
> > time to start pushing back.
>
> Well, I have to say that to the extent we've pushed back on registrations, what
> we've ended up with is ad-hoc mess of unregistered usage. I am therefore quite
> skeptical of any belief that pushing back on registrations is a useful tactic.
>
> > And that, of course, would dramatically change the work of the
> > charset reviewer by reducing the volume but increasing the
> > amount of evaluation to be done.
>
> Even if we closed the registry completely there is still a bunch of work to do
> in terms of registry cleanup.
>
> Now, having said all this, I'm willing to take on the role of charset reviewer,
> but with the understanding that one of the things I will do is conduct a
> complete overhaul of the existing registry. [***] Such a substantive change will
> of course require some degree of oversight, which in turn means I'd like to see
> some commitment from the IESG of support for the effort.
>
> As for qualifications, I did write the charset registration specification, and
> I also wrote and continue to maintain a fairly full-features charset conversion
> library. I can provide more detail if anyone cares.
>
>                                 Ned
>
> [*] - For those not fully up to speed on this stuff, gb18030 can be seen as an
> encoding of Unicode that is backwards compatible with the previous simplified
> Chinese charsets gb2312 and gbk.
>
> [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> interesting case. I believe this one needed to be pushed on, but not
> because of potential use in new applications or protocols.
>
> [***] - I have the advantage of being close enough to IANA that I can drive
> over there and have F2F meetings should the need arise - and I suspect
> it will.
>

John C Klensin | 7 Sep 15:58

Re: Volunteer needed to serve as IANA charset reviewer

Ned,

Several observations...

The first is that my note was intended as "is it time to review
RFC 2978 and the definition of the charset reviewer job".  Just
a question.  I had no expectation of discontinuing the current
registry, nor any realistic one of banning future registrations.
I think your comments, Mark's, and those of others are
consistent with my goal in asking the question.  What should be
done is another matter -- see below.

Second, while I agree with your concern about GB 18030 and its
ilk, what I learned in trying to put a network-Unicode
definition together (see draft-klensin-net-utf8-01.txt) is that,
for practical use, just specifying "UTF-8" may not be good
enough either.  For example, for at least most purposes other
than pure rendering, one probably wants to specify the
normalization form (ideally a "stable" one(++)) for text going
on the wire, so "Unicode, in Stable NFC, encoded in UTF-8" is
probably the level of specification we are looking for, not
"UTF-8".   I deliberately said "Unicode" in my note, not because
I thought it was adequate, but because I was certain that it
would expose this issue if we got this far.

If we really need to be pushing toward a specific encoding and
either the required specification of the normalization applied
or, preferably, a specific normalization, then RFC 2978 isn't
our only issue -- we need to review, and possibly reopen RFC
2277 and 3629 and might need to look at some other
specifications.  Realizing this was what caused me to
temporarily put the  network-Unicode draft on hold.

I am delighted that you would be willing to take this on -- I
think you have just exactly the right combination of skill and
experience with both character sets and Internet applications
protocols.

Your ability to do the currently-defined job, or a slightly
different one, is largely independent of whether the
specifications for new additions to the registry are what we
should have today.  Clearly, the registry serves the purpose of
reducing the odds of the same name being used, inadvertently, to
describe different things and that is a benefit in itself.  Mark
suggests that the definitions are not sufficiently consistent
and of high quality to be used for anything else.    I think we
need to figure out what we need (does the current quality of
registrations meet your criteria for "accurately and
consistently"?) and then respecify things so that we get it on
future reservations (and maybe can ask IANA to send out requests
for clarification to relevant existing ones).  Certainly your
notion of overhauling the current registry is consistent with
this... it even goes beyond what I had hoped there were energy
for.

You wrote...

> The plain fact of the matter is that we have done a miserable
> job of producing an accurate and useful charset registry, and
> considerable work needs to be done both to register various
> missing charsets as well as to clean up the existing registry,
> which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind
> pushing back on, say, the recent registration of iso-8859-11,
> is an overreaction to a non-problem. [**]

Speaking personally, we are in complete agreement.  

> Well, I have to say that to the extent we've pushed back on
> registrations, what we've ended up with is ad-hoc mess of
> unregistered usage. I am therefore quite skeptical of any
> belief that pushing back on registrations is a useful tactic.

Also agree, regardless of what my note appeared to say (in the
interest of opening up exactly this discussion).

    john

++ For those who have not been following that particular piece
of work, the Unicode Consortium now has a proposal for "Stable
Normalization Process" under public review (see
http://www.unicode.org/review/pr-95.html).  It differs from the
existing normalization forms by applying additional prohibitions
on unassigned code points and problematic sequences and
originated from discussions about the conditions under which
IDNA and Stringprep could be migrated from Unicode 3.2 to
contemporary versions.  I would encourage those in IETF who are
interested in these issues to review that proposal carefully and
comment on it as appropriate.

Jefsey_Morfin | 7 Sep 01:02

Re: Volunteer needed to serve as IANA charset reviewer

At 22:30 06/09/2006, Ned Freed wrote:
>Now, having said all this, I'm willing to take on the role of
>charset reviewer,
>but with the understanding that one of the things I will do is conduct a
>complete overhaul of the existing registry. [***] Such a substantive
>change will
>of course require some degree of oversight, which in turn means I'd
>like to see
>some commitment from the IESG of support for the effort.

+1

>As for qualifications, I did write the charset registration specification, and
>I also wrote and continue to maintain a fairly full-features charset
>conversion
>library. I can provide more detail if anyone cares.

I care.
thank you.
jfc



Terje Bless | 7 Sep 00:03
Picon
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

[ My apologies for replying to a reply ]

Ned Freed <ned.freed <at> mrochek.com> wrote:

>>I wonder if we are reaching the point at which a stronger "use Unicode on
>>the wire" recommendation would be in order.   The implications of such a
>>recommendation would presumably include a 2978bis that made the requirements
>>for registration of a new charset _much_ tougher, e.g., requiring a
>>demonstration that the then-current version of Unicode cannot do the
>>relevant job and/or evidence that the newly-proposed charset is needed in
>>deployed applications.

The time is, IMO, certainly ripe for pushing UTF-8 much stronger, but the place
to do so is *not* at IANA — the registry of assigned names and numbers; protocol
values — but rather in the development of new specifications.

Not even the Unicode Consortium envisions a mass conversation of all legacy
content into, say, UTF-8. The IANA registry's documentary function is quite
orthogonal to the desire to avoid defining new charsets or mandating or even
just enabling legacy charsets in new specifications.

If a charset exists it should, modulo other factors, be registered with IANA.

--

-- 
Everytime I write a rhyme these people thinks its a crime
I tell `em what's on my mind. I guess I'm a CRIMINAL!
I don't gotta say a word I just flip `em the bird and keep goin,
I don't take shit from no one. I'm a CRIMINAL!

Keith Moore | 6 Sep 23:45
Picon

Re: Volunteer needed to serve as IANA charset reviewer

I concur with the need to maintain the current charset registry to
support legacy apps that use it.  

And I think Ned would be an excellent choice for reviewer, though it
wouldn' t bother me if he could have the assistance of people with
specialized expertise in Asian writing schemes.

As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
specifying Unicode isn't sufficient given the potential for
incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
pessimizes storage and transmission of text written in certain
languages.  IMHO it's unreasonable to exclude the potential for a
Unicode based CES that has more-or-less equivalent information
density across a wide variety of languages.  But I do think that use of
multiple CESs in a new protocol should require substantial
justification, and that UTF-8 should be presumed to be the CES of
choice for any new protocol that requires ASCII compatibility for its
character representation.

Keith

Martin Duerst | 8 Sep 12:02
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

At 06:45 06/09/07, Keith Moore wrote:
>I concur with the need to maintain the current charset registry to
>support legacy apps that use it.  

I concur with Keith (and it seems almost everybody else) that we
still need a charset registry.

>And I think Ned would be an excellent choice for reviewer, though it
>wouldn' t bother me if he could have the assistance of people with
>specialized expertise in Asian writing schemes.

He would certainly have my assistance, for whatever it's worth.

>As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
>specifying Unicode isn't sufficient given the potential for
>incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
>pessimizes storage and transmission of text written in certain
>languages.

True. The most affected languages are not CJK (Chinese, Japanese, Korean),
but all the scripts that have most of their characters beyond
U+0800 but don't need two bytes to encode the particular script,
i.e. all the Indian Scripts, and so on. A serious part of the
overhead is often (but not always) compensated by the fact that
protocol or markup information is usually heavily ascii-biased.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

Jefsey_Morfin | 8 Sep 14:45

Re: Volunteer needed to serve as IANA charset reviewer

At 12:02 08/09/2006, Martin Duerst wrote:
>True. The most affected languages are not CJK (Chinese, Japanese,
>Korean), but all the scripts that have most of their characters beyond
>U+0800 but don't need two bytes to encode the particular script,
>i.e. all the Indian Scripts, and so on. A serious part of the
>overhead is often (but not always) compensated by the fact that
>protocol or markup information is usually heavily ascii-biased.

Correct, this is one part of their problem, the other is the
difference between graphemes and characters. However, the first
question is: is the charset registry meant to register existing
charsets or the IETF to standardise the new charsets and keyboards
language need?  Or do you mean you would suggest designing new
charsets somewhere else and to have them registered by the IETF on the IANA?
jfc  

Ned Freed | 7 Sep 00:58

Re: Volunteer needed to serve as IANA charset reviewer

> I concur with the need to maintain the current charset registry to
> support legacy apps that use it.

> And I think Ned would be an excellent choice for reviewer, though it
> wouldn' t bother me if he could have the assistance of people with
> specialized expertise in Asian writing schemes.

Any such assistance would be hugely welcome. As an aside, it would also be nice
if more people would post comments to the list...

> As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages.  IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.  But I do think that use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

This is pretty much where I'm at as well. I have no problem with UTF-16 or
UTF-32 if there is a compelling reason to allow them, but I really want to at
least try and close the door to additional CESes to the greatest extent
possible. Of course this is really an issue for the IAB and not the charset
reviewer - thank goodness.

				Ned

Bruce Lilly | 7 Sep 12:33
Picon

Re: Volunteer needed to serve as IANA charset reviewer

On Wed September 6 2006 18:58, Ned Freed wrote:
> > I concur with the need to maintain the current charset registry to
> > support legacy apps that use it.
> 
> > And I think Ned would be an excellent choice for reviewer, though it
> > wouldn' t bother me if he could have the assistance of people with
> > specialized expertise in Asian writing schemes.
> 
> Any such assistance would be hugely welcome. As an aside, it would also be nice
> if more people would post comments to the list...

OK.  I concur with most of what has already been said by others, specifically
that if a charset (i.e. something meeting the definition of charset) is in
use, it ought to be registered; using the registry as a way to force some
agenda is a very bad idea.  Also that Ned would be an excellent choice for
reviewer, and I would add that I fully support his stated plan to overhaul
the existing registry, which has long been in need of such an overhaul (e.g.
the registration procedure has long said that "ASCII" is disallowed, yet it
is in fact registered as an alias).

A few differences of opinion:
Keith Moore wrote:
> > But I do think that use of
> > multiple CESs in a new protocol should require substantial
> > justification, and that UTF-8 should be presumed to be the CES of
> > choice for any new protocol that requires ASCII compatibility for its
> > character representation.

There may well be areas of application for new protocols which cannot fully
support Unicode which underlies use of utf-8, due to character set size,
huge tables needed for normalization, etc. (see sections 3.1 (paying particular
attention to "memory-starved microprocessors") and 3.4 of RFC 1958).  Not all
protocols need to fully support utf-8 directly; the highly successful mail
system, for example, supports only a subset of ANSI X3.4 in message header
fields, yet it allows pass-through of utf-8 and other charsets via RFC 2047
mechanisms as amended by RFC 2231 and errata.

Ted Hardie wrote:
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.

To be precise, RFC 2277 says:
"   Protocols MUST be able to use the UTF-8 charset, which consists of
   the ISO 10646 coded character set combined with the UTF-8 character
   encoding scheme, as defined in [10646] Annex R (published in
   Amendment 2), for all text.

   Protocols MAY specify, in addition, how to use other charsets or
   other character encoding schemes for ISO 10646, such as UTF-16, but
   lack of an ability to use UTF-8 is a violation of this policy; such a
   violation would need a variance procedure ([BCP9] section 9) with
   clear and solid justification in the protocol specification document
   before being entered into or advanced upon the standards track.

   For existing protocols or protocols that move data from existing
   datastores, support of other charsets, or even using a default other
   than UTF-8, may be a requirement. This is acceptable, but UTF-8
   support MUST be possible.

   When using other charsets than UTF-8, these MUST be registered in the
   IANA charset registry, if necessary by registering them when the
   protocol is published.
"
Several points:
1. "MUST be able to use" is a bit different from "requires" (see the above
   example of the mail system, which is able to use utf-8 by the mechanisms
   noted, but which does not require and in fact cannot directly accommodate
   raw utf-8).
2. The explicitly stated policy of allowing alternative charsets is important.
3. Most important, note that 2277 explicitly requires registration.

> I have no problem with UTF-16 or
> UTF-32 if there is a compelling reason to allow them,

Well neither (as well as their "BE" and "LE" variants) is suitable for use
with MIME text types, which precludes their use in a number of important
applications.  And one thing the charset registry sorely needs is a more
explicit indication of which charsets are/are not suitable for such use
(heck, some registrations have lacked the required statement of
[un]suitability, so even groping through all of the registrations is of
no use (and don't get me started on RFC 1345 issues)).

Picon
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> On Wed September 6 2006 18:58, Ned Freed wrote:
> > > I concur with the need to maintain the current charset registry to
> > > support legacy apps that use it.
> > 
> > > And I think Ned would be an excellent choice for reviewer, though it
> > > wouldn' t bother me if he could have the assistance of people with
> > > specialized expertise in Asian writing schemes.
> > 
> > Any such assistance would be hugely welcome. As an aside, it would also be nice
> > if more people would post comments to the list...
> 
> OK.  I concur with most of what has already been said by others, specifically
> that if a charset (i.e. something meeting the definition of charset) is in
> use, it ought to be registered; using the registry as a way to force some
> agenda is a very bad idea.  Also that Ned would be an excellent choice for
> reviewer, and I would add that I fully support his stated plan to overhaul
> the existing registry, which has long been in need of such an overhaul (e.g.
> the registration procedure has long said that "ASCII" is disallowed, yet it
> is in fact registered as an alias).

There seems to be a problem here, but maybe it whould then be the
procedures that be revised, as ASCII is a well known name for a specific
character set.

I can agree to that ascii not be the recommended name for the specific
charset.

Best regards
keld

Bruce Lilly | 7 Sep 23:17
Picon

Re: Volunteer needed to serve as IANA charset reviewer

[cc's trimmed]
On Thu September 7 2006 11:56, Keld Jørn Simonsen wrote:
> On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> > the registration procedure has long said that "ASCII" is disallowed, yet it
> > is in fact registered as an alias).
> 
> There seems to be a problem here, but maybe it whould then be the
> procedures that be revised, as ASCII is a well known name for a specific
> character set.

Quoting RFC 2046:
"  The character set name "ASCII" is reserved and must not
   be used for any purpose.
"

The same text appeared in RFC 1521 and in RFC 1341, dated 1992.

[corrigenda]  The current registration procedures per se do not
forbid "ASCII", but MIME initially established charset registrations,
and the "for any purpose" certainly seems clear.  Quoting RFC 1341:
"           Several other MIME fields, notably
            including character set names, are likely to have new values
            defined  over time.  In order to ensure that the set of such
            values is  developed  in  an  orderly,  well-specified,  and
            public  manner,  MIME  defines  a registration process which
            uses the Internet Assigned Numbers  Authority  (IANA)  as  a
            central  registry  for  such  values.
"
And RFC 1341 Appendix F section 2 is, as far as I know, the initial
(abbreviated) character set registration procedure.

So at least as far as MIME is concerned, "ASCII" has always been
forbidden; the default and preferred MIME name for ANSI X3.4 is
"US-ASCII".

One problem is that "ASCII" has been [mis]used for things other than
one specific character set and is therefore not unambiguous.

Also, we should distinguish informal usage from registered names
used in protocols.

As with most IANA registries, it would be quite unwise to remove something
once registered.  So I wouldn't want to simply remove "ASCII" leaving no
trace in case there is some archived content which used that alias in spite
of the prohibition against such use.  I would support a mechanism to mark
(clearly, and in the registry) a name as deprecated, along with a
"MUST NOT generate" rule applicable to deprecated names.

------------------

Another footnote: By noting that Ned would make a fine charset reviewer,
I am not indicating any fault with Paul Hoffman, who is still listed as
charset reviewer on the IANA site (http://www.iana.org/numbers.html#C)
and who has done a fine job as evidenced by his past participation on this
mailing list. 

Claus Färber | 16 Sep 03:44
Picon

Re: Volunteer needed to serve as IANA charset reviewer

Bruce Lilly schrieb:
> One problem is that "ASCII" has been [mis]used for things other than
> one specific character set and is therefore not unambiguous.

Maybe it should be possible to register charsets that _are_ ambigous. Of 
course, there should be a warning not to use them if at all possible. 
Still, some applications which don't know the charset (converters from 
other formats) might make use of them if its not feasible to detect the 
true charset.

"ASCII" could be an alias for "UNKNOWN-7BIT" (or "UNKNOWN-ISO-646", 
"UNKNOWN-ASCII"?)

Other possible ambigous charsets could be:

"UNKNOWN-8BIT" (already used by some mail transport agents when 
MIMEifying messages).
"UNKNOWN-EBCDIC"
"UNKNOWN-UTF16" with alias "UNICODE".
"UNKNOWN-ISO-8859" with alias "ANSI".
"UNKNOWM-IBMPC" with alias "OEM".

Claus

Frank Ellermann | 17 Sep 14:50
Picon
Picon

Re: Volunteer needed to serve as IANA charset reviewer

Claus Färber wrote:

> "UNKNOWN-8BIT" (already used by some mail transport agents

First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
already registered.

> "UNKNOWN-UTF16"

What's the difference from UTF-16 ?

> with alias "UNICODE".

Ugh, thanks, but no thanks.

> "UNKNOWN-ISO-8859" with alias "ANSI".
> "UNKNOWM-IBMPC" with alias "OEM".

One of those could do, "unknown-ascii-8bit", alias "oem".

Frank

Claus Färber | 1 Oct 20:18
Picon

Re: Volunteer needed to serve as IANA charset reviewer

Frank Ellermann schrieb:
> Claus Färber wrote:
>> "UNKNOWN-8BIT" (already used by some mail transport agents
> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
> already registered.

Oops.

>> "UNKNOWN-UTF16"
> What's the difference from UTF-16 ?

UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, 
RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a fall back.

>> with alias "UNICODE".
> Ugh, thanks, but no thanks.

The idea is to deprecate the label "UNICODE" by tying it to an 
incompletly specified charset.

>> "UNKNOWN-ISO-8859" with alias "ANSI".
>> "UNKNOWM-IBMPC" with alias "OEM".
> 
> One of those could do, "unknown-ascii-8bit", alias "oem".

We already have UNKNOWN-8BIT.

When you convert legacy data, you often DO know that something is in a 
DOSish (IBMPC-based) or Windowsish (ANSI-based) charset. Having charset 
labels to carry this information (instead of the unspecified 
UNKNOWN-8BIT) is a good idea.

Claus

Martin Duerst | 2 Oct 11:38
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

Ned and me, as newly appointed charset reviewers,
plan to first address pending registrations, and once
they are dealt with, looking at ways to clean up the
registry.

At 03:18 06/10/02, Claus F舐ber wrote:
>Frank Ellermann schrieb:
>> Claus F舐ber wrote:
>>> "UNKNOWN-8BIT" (already used by some mail transport agents
>> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
>> already registered.
>
>Oops.

 From a purely personal viewpoint, this one actually occasionally
came in handy.

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?
>
>UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would
not have such a fall back.

Has UNKNOWN-UTF-16 been proposed formally, or is this just an
idea floated in an email? As a reviewer, I'd prefer to deal
with "really existing charsets" first.

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.
>
>The idea is to deprecate the label "UNICODE" by tying it to an incompletly specified charset.

Personally, I agree with the idea of deprecating "Unicode".
As a charset reviewer, I think this should be done by just
noting the entry as DECRECATED or OBSOLETE or some such,
rather than by registering additional aliases.

>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".
>> One of those could do, "unknown-ascii-8bit", alias "oem".
>
>We already have UNKNOWN-8BIT.
>
>When you convert legacy data, you often DO know that something is in a DOSish (IBMPC-based) or Windowsish
(ANSI-based) charset. Having charset labels to carry this information (instead of the unspecified
UNKNOWN-8BIT) is a good idea.

To repeat, as a reviewer, I'd prefer to deal with "really existing
charsets" first. We may be able to consider ideas such as these
later, if we look at more and less precise labels for encodings
(e.g. labels to indicate various variants of Shift_JIS).

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

Mark Davis | 2 Oct 17:22
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

I'd suggest taking a look at the ICU charset data. This was gathered by calling APIs on different platforms, instead of going by the documentation, which was often false.

http://icu.sourceforge.net/charts/charset/
http://icu.sourceforge.net/charts/charset/roundtripIndex.html

The other thing that needs to be done is establish criteria for identity. If two mappings are identical except that one adds an additional mapping from bytes to Unicode, which gets registered? Both? The subset? The superset?

There are literally hundreds of such cases, so without clarity it doesn't help to propose registrations.

Mark

On 10/2/06, Martin Duerst < duerst <at> it.aoyama.ac.jp> wrote:
Ned and me, as newly appointed charset reviewers,
plan to first address pending registrations, and once
they are dealt with, looking at ways to clean up the
registry.

At 03:18 06/10/02, Claus F舐ber wrote:
>Frank Ellermann schrieb:
>> Claus F舐ber wrote:
>>> "UNKNOWN-8BIT" (already used by some mail transport agents
>> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
>> already registered.
>
>Oops.

From a purely personal viewpoint, this one actually occasionally
came in handy.

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?
>
>UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a fall back.

Has UNKNOWN-UTF-16 been proposed formally, or is this just an
idea floated in an email? As a reviewer, I'd prefer to deal
with "really existing charsets" first.

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.
>
>The idea is to deprecate the label "UNICODE" by tying it to an incompletly specified charset.

Personally, I agree with the idea of deprecating "Unicode".
As a charset reviewer, I think this should be done by just
noting the entry as DECRECATED or OBSOLETE or some such,
rather than by registering additional aliases.

>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".
>> One of those could do, "unknown-ascii-8bit", alias "oem".
>
>We already have UNKNOWN-8BIT.
>
>When you convert legacy data, you often DO know that something is in a DOSish (IBMPC-based) or Windowsish (ANSI-based) charset. Having charset labels to carry this information (instead of the unspecified UNKNOWN-8BIT) is a good idea.

To repeat, as a reviewer, I'd prefer to deal with "really existing
charsets" first. We may be able to consider ideas such as these
later, if we look at more and less precise labels for encodings
(e.g. labels to indicate various variants of Shift_JIS).


Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#   http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp


Martin Duerst | 3 Oct 07:17
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

Hello Mark,

We should definitely start to look at such issues once we have
processed the backlog of requests and have cleaned up some of
the garbage in the current registry.

On the side, I think it would be great if
<http://icu.sourceforge.net/charts/charset/roundtripIndex.html>http://icu.sourceforge.net/charts/charset/roundtripIndex.html
could be split up into some smaller pages. It's really huge.

Regards,    Martin.

At 00:22 06/10/03, Mark Davis wrote:
>I'd suggest taking a look at the ICU charset data. This was gathered by calling APIs on different
platforms, instead of going by the documentation, which was often false.
>
><http://icu.sourceforge.net/charts/charset/>http://icu.sourceforge.net/charts/charset/
>http://icu.sourceforge.net/charts/charset/roundtripIndex.html
>
>The other thing that needs to be done is establish criteria for identity. If two mappings are identical
except that one adds an additional mapping from bytes to Unicode, which gets registered? Both? The
subset? The superset? 
>
>There are literally hundreds of such cases, so without clarity it doesn't help to propose registrations.
>
>Mark

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

Frank Ellermann | 2 Oct 02:59
Picon
Picon

unknown-xyz (was: Volunteer needed to serve as IANA charset reviewer)

Claus Frber wrote:

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?

> UTF-16 "SHOULD be interpreted as being big-endian" if there's
> no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a
> fall back.

Okay, but with a good excuse violating a SHOULD is possible...

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.

> The idea is to deprecate the label "UNICODE" by tying it to
> an incompletly specified charset.

...sneaky <g>

In reality that boils down to "any even number of octets not
including 0xfeff or 0xfffe", or do I miss something ?  Who
could be interested in that difference from "unknown-8bit" ?

---
>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".

>> One of those could do, "unknown-ascii-8bit", alias "oem".

> We already have UNKNOWN-8BIT.
> When you convert legacy data, you often DO know that 
> something is in a DOSish (IBMPC-based) or Windowsish
> (ANSI-based) charset. Having charset labels to carry
> this information (instead of the unspecified UNKNOWN-8BIT)
> is a good idea.

Yes, but why the difference, who's supposed to guess what's
what, and who's interested in the dubious outcome of such
guesses ?

If I screw-up what you get is a bogus "Latin-1", and you can
correctly guess that it must be bogus as soon as you find any
C1 octets.  But without human intervention you don't know how
I screwed up, it's windows-1252, pc-multilingual-850+euro, or
worse (cp437, wild mixtures, who knows).

An "unknown-ascii-8bit" => neither ISO-8859-x nor UTF-8, but
at least MIME compatible (one hopes).

The W3C validator could make use of that "unknown-ascii-8bit",
one error for that (if it's only a guess), but then continue
to report unrelated interesting errors.

Frank
--

-- 
Honk for 4234 to STD

Picon
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

On Thu, Sep 07, 2006 at 05:17:18PM -0400, Bruce Lilly wrote:
> [cc's trimmed]
> On Thu September 7 2006 11:56, Keld Jørn Simonsen wrote:
> > On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> > > the registration procedure has long said that "ASCII" is disallowed, yet it
> > > is in fact registered as an alias).
> > 
> > There seems to be a problem here, but maybe it whould then be the
> > procedures that be revised, as ASCII is a well known name for a specific
> > character set.
> 
> Quoting RFC 2046:
> "  The character set name "ASCII" is reserved and must not
>    be used for any purpose.
> "

Well, that is fine for me, we can have the name registered but not used
for any purpose in IETF specs. I think this is what we meant with this
statement, when we wrote it.

> So at least as far as MIME is concerned, "ASCII" has always been
> forbidden; the default and preferred MIME name for ANSI X3.4 is
> "US-ASCII".

Agree

> One problem is that "ASCII" has been [mis]used for things other than
> one specific character set and is therefore not unambiguous.

Agree

> Also, we should distinguish informal usage from registered names
> used in protocols.
> 
> As with most IANA registries, it would be quite unwise to remove something
> once registered.  So I wouldn't want to simply remove "ASCII" leaving no
> trace in case there is some archived content which used that alias in spite
> of the prohibition against such use.  I would support a mechanism to mark
> (clearly, and in the registry) a name as deprecated, along with a
> "MUST NOT generate" rule applicable to deprecated names.

Also agree with you here.

Best regards
Keld

Tim Bray | 7 Sep 01:27
Favicon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

On Sep 6, 2006, at 2:45 PM, Keith Moore wrote:

> As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages.  IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.  But I do think that  
> use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

Agreed on all counts.  Section 5.1 of RFC3470 (aka BCP70) says smart  
things about this, referencing 2277.  Basically, if you're going to  
use XML, there's probably no point trying to legislate against UTF-16  
since any conformant reader is required to accept it, and in practice  
all known XML software can handle 8859 and Shift-JIS and EUC.   But  
if you're not doing XML, compulsory UTF-8 removes a lot of failure  
points without costing much.

   -Tim

Martin Duerst | 8 Sep 12:02
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

At 06:45 06/09/07, Keith Moore wrote:
>I concur with the need to maintain the current charset registry to
>support legacy apps that use it.  

I concur with Keith (and it seems almost everybody else) that we
still need a charset registry.

>And I think Ned would be an excellent choice for reviewer, though it
>wouldn' t bother me if he could have the assistance of people with
>specialized expertise in Asian writing schemes.

He would certainly have my assistance, for whatever it's worth.

>As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
>specifying Unicode isn't sufficient given the potential for
>incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
>pessimizes storage and transmission of text written in certain
>languages.

True. The most affected languages are not CJK (Chinese, Japanese, Korean),
but all the scripts that have most of their characters beyond
U+0800 but don't need two bytes to encode the particular script,
i.e. all the Indian Scripts, and so on. A serious part of the
overhead is often (but not always) compensated by the fact that
protocol or markup information is usually heavily ascii-biased.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

Tim Bray | 7 Sep 01:27
Favicon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

On Sep 6, 2006, at 2:45 PM, Keith Moore wrote:

> As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages.  IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.  But I do think that  
> use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

Agreed on all counts.  Section 5.1 of RFC3470 (aka BCP70) says smart  
things about this, referencing 2277.  Basically, if you're going to  
use XML, there's probably no point trying to legislate against UTF-16  
since any conformant reader is required to accept it, and in practice  
all known XML software can handle 8859 and Shift-JIS and EUC.   But  
if you're not doing XML, compulsory UTF-8 removes a lot of failure  
points without costing much.

   -Tim

John C Klensin | 7 Sep 15:58

Re: Volunteer needed to serve as IANA charset reviewer

Ned,

Several observations...

The first is that my note was intended as "is it time to review
RFC 2978 and the definition of the charset reviewer job".  Just
a question.  I had no expectation of discontinuing the current
registry, nor any realistic one of banning future registrations.
I think your comments, Mark's, and those of others are
consistent with my goal in asking the question.  What should be
done is another matter -- see below.

Second, while I agree with your concern about GB 18030 and its
ilk, what I learned in trying to put a network-Unicode
definition together (see draft-klensin-net-utf8-01.txt) is that,
for practical use, just specifying "UTF-8" may not be good
enough either.  For example, for at least most purposes other
than pure rendering, one probably wants to specify the
normalization form (ideally a "stable" one(++)) for text going
on the wire, so "Unicode, in Stable NFC, encoded in UTF-8" is
probably the level of specification we are looking for, not
"UTF-8".   I deliberately said "Unicode" in my note, not because
I thought it was adequate, but because I was certain that it
would expose this issue if we got this far.

If we really need to be pushing toward a specific encoding and
either the required specification of the normalization applied
or, preferably, a specific normalization, then RFC 2978 isn't
our only issue -- we need to review, and possibly reopen RFC
2277 and 3629 and might need to look at some other
specifications.  Realizing this was what caused me to
temporarily put the  network-Unicode draft on hold.

I am delighted that you would be willing to take this on -- I
think you have just exactly the right combination of skill and
experience with both character sets and Internet applications
protocols.

Your ability to do the currently-defined job, or a slightly
different one, is largely independent of whether the
specifications for new additions to the registry are what we
should have today.  Clearly, the registry serves the purpose of
reducing the odds of the same name being used, inadvertently, to
describe different things and that is a benefit in itself.  Mark
suggests that the definitions are not sufficiently consistent
and of high quality to be used for anything else.    I think we
need to figure out what we need (does the current quality of
registrations meet your criteria for "accurately and
consistently"?) and then respecify things so that we get it on
future reservations (and maybe can ask IANA to send out requests
for clarification to relevant existing ones).  Certainly your
notion of overhauling the current registry is consistent with
this... it even goes beyond what I had hoped there were energy
for.

You wrote...

> The plain fact of the matter is that we have done a miserable
> job of producing an accurate and useful charset registry, and
> considerable work needs to be done both to register various
> missing charsets as well as to clean up the existing registry,
> which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind
> pushing back on, say, the recent registration of iso-8859-11,
> is an overreaction to a non-problem. [**]

Speaking personally, we are in complete agreement.  

> Well, I have to say that to the extent we've pushed back on
> registrations, what we've ended up with is ad-hoc mess of
> unregistered usage. I am therefore quite skeptical of any
> belief that pushing back on registrations is a useful tactic.

Also agree, regardless of what my note appeared to say (in the
interest of opening up exactly this discussion).

    john

++ For those who have not been following that particular piece
of work, the Unicode Consortium now has a proposal for "Stable
Normalization Process" under public review (see
http://www.unicode.org/review/pr-95.html).  It differs from the
existing normalization forms by applying additional prohibitions
on unassigned code points and problematic sequences and
originated from discussions about the conditions under which
IDNA and Stringprep could be migrated from Unicode 3.2 to
contemporary versions.  I would encourage those in IETF who are
interested in these issues to review that proposal carefully and
comment on it as appropriate.

Mark Davis | 7 Sep 01:44
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the registry, but that are in *far* more widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html

2. Ill-defined registrations (crucial)
  a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.
  b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.
  c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html

As a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows vs Linux...), by programming language [Java) or by version of programming language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all. We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf. http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information. As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

Mark


> I agree that we've reached a point where "use UTF-8" is what we need to be
> pushing for in new protocol development. (Note that I said UTF-8 and not
> Unicode - given the existance of gb18030 [*] I don't regard a recommendation of
> "use Unicode" as even close to sufficient. The last thing we want is to see the
> development of specializesd Unicode CESes for Korean, Japanese, Arabic, Hebrew,
> Thai, and who knows what else.) And if the reason there are new charset
> registrations was because of the perceived need to have new charsets for use in
> new protocols, I would be in total agreement that a change in focus for charset
> registration is in order.
>
> But that's not why we're seeing new registrations. The new registrations we're
> seeing are of legacy charsets used in legacy applications and protocols that
> for whatever reason never got registered previously. Given that these things
> are in use in various nooks and crannies around the world, it is critically
> important that when they are used they are labelled accurately and
> consistently.
>
> The plain fact of the matter is that we have done a miserable job of producing
> an accurate and useful charset registry, and considerable work needs to be done
> both to register various missing charsets as well as to clean up the existing
> registry, which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind pushing back on, say,
> the recent registration of iso-8859-11, is an overreaction to a non-problem.
> [**]
>
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.   More options
> > and possibilities for local codings that are not generally known
> > and supported do not help with interoperability; perhaps it is
> > time to start pushing back.
>
> Well, I have to say that to the extent we've pushed back on registrations, what
> we've ended up with is ad-hoc mess of unregistered usage. I am therefore quite
> skeptical of any belief that pushing back on registrations is a useful tactic.
>
> > And that, of course, would dramatically change the work of the
> > charset reviewer by reducing the volume but increasing the
> > amount of evaluation to be done.
>
> Even if we closed the registry completely there is still a bunch of work to do
> in terms of registry cleanup.
>
> Now, having said all this, I'm willing to take on the role of charset reviewer,
> but with the understanding that one of the things I will do is conduct a
> complete overhaul of the existing registry. [***] Such a substantive change will
> of course require some degree of oversight, which in turn means I'd like to see
> some commitment from the IESG of support for the effort.
>
> As for qualifications, I did write the charset registration specification, and
> I also wrote and continue to maintain a fairly full-features charset conversion
> library. I can provide more detail if anyone cares.
>
>                                 Ned
>
> [*] - For those not fully up to speed on this stuff, gb18030 can be seen as an
> encoding of Unicode that is backwards compatible with the previous simplified
> Chinese charsets gb2312 and gbk.
>
> [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> interesting case. I believe this one needed to be pushed on, but not
> because of potential use in new applications or protocols.
>
> [***] - I have the advantage of being close enough to IANA that I can drive
> over there and have F2F meetings should the need arise - and I suspect
> it will.
>


Gmane