Keith Moore | 6 Sep 23:45
Picon

Re: Volunteer needed to serve as IANA charset reviewer

I concur with the need to maintain the current charset registry to
support legacy apps that use it.  

And I think Ned would be an excellent choice for reviewer, though it
wouldn' t bother me if he could have the assistance of people with
specialized expertise in Asian writing schemes.

As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
specifying Unicode isn't sufficient given the potential for
incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
pessimizes storage and transmission of text written in certain
languages.  IMHO it's unreasonable to exclude the potential for a
Unicode based CES that has more-or-less equivalent information
density across a wide variety of languages.  But I do think that use of
multiple CESs in a new protocol should require substantial
justification, and that UTF-8 should be presumed to be the CES of
choice for any new protocol that requires ASCII compatibility for its
character representation.

Keith

Martin Duerst | 8 Sep 12:02
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

At 06:45 06/09/07, Keith Moore wrote:
>I concur with the need to maintain the current charset registry to
>support legacy apps that use it.  

I concur with Keith (and it seems almost everybody else) that we
still need a charset registry.

>And I think Ned would be an excellent choice for reviewer, though it
>wouldn' t bother me if he could have the assistance of people with
>specialized expertise in Asian writing schemes.

He would certainly have my assistance, for whatever it's worth.

>As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
>specifying Unicode isn't sufficient given the potential for
>incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
>pessimizes storage and transmission of text written in certain
>languages.

True. The most affected languages are not CJK (Chinese, Japanese, Korean),
but all the scripts that have most of their characters beyond
U+0800 but don't need two bytes to encode the particular script,
i.e. all the Indian Scripts, and so on. A serious part of the
overhead is often (but not always) compensated by the fact that
protocol or markup information is usually heavily ascii-biased.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     
(Continue reading)

Jefsey_Morfin | 8 Sep 14:45

Re: Volunteer needed to serve as IANA charset reviewer

At 12:02 08/09/2006, Martin Duerst wrote:
>True. The most affected languages are not CJK (Chinese, Japanese,
>Korean), but all the scripts that have most of their characters beyond
>U+0800 but don't need two bytes to encode the particular script,
>i.e. all the Indian Scripts, and so on. A serious part of the
>overhead is often (but not always) compensated by the fact that
>protocol or markup information is usually heavily ascii-biased.

Correct, this is one part of their problem, the other is the
difference between graphemes and characters. However, the first
question is: is the charset registry meant to register existing
charsets or the IETF to standardise the new charsets and keyboards
language need?  Or do you mean you would suggest designing new
charsets somewhere else and to have them registered by the IETF on the IANA?
jfc  

Ned Freed | 7 Sep 00:58

Re: Volunteer needed to serve as IANA charset reviewer

> I concur with the need to maintain the current charset registry to
> support legacy apps that use it.

> And I think Ned would be an excellent choice for reviewer, though it
> wouldn' t bother me if he could have the assistance of people with
> specialized expertise in Asian writing schemes.

Any such assistance would be hugely welcome. As an aside, it would also be nice
if more people would post comments to the list...

> As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages.  IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.  But I do think that use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

This is pretty much where I'm at as well. I have no problem with UTF-16 or
UTF-32 if there is a compelling reason to allow them, but I really want to at
least try and close the door to additional CESes to the greatest extent
possible. Of course this is really an issue for the IAB and not the charset
reviewer - thank goodness.

				Ned

Bruce Lilly | 7 Sep 12:33
Picon

Re: Volunteer needed to serve as IANA charset reviewer

On Wed September 6 2006 18:58, Ned Freed wrote:
> > I concur with the need to maintain the current charset registry to
> > support legacy apps that use it.
> 
> > And I think Ned would be an excellent choice for reviewer, though it
> > wouldn' t bother me if he could have the assistance of people with
> > specialized expertise in Asian writing schemes.
> 
> Any such assistance would be hugely welcome. As an aside, it would also be nice
> if more people would post comments to the list...

OK.  I concur with most of what has already been said by others, specifically
that if a charset (i.e. something meeting the definition of charset) is in
use, it ought to be registered; using the registry as a way to force some
agenda is a very bad idea.  Also that Ned would be an excellent choice for
reviewer, and I would add that I fully support his stated plan to overhaul
the existing registry, which has long been in need of such an overhaul (e.g.
the registration procedure has long said that "ASCII" is disallowed, yet it
is in fact registered as an alias).

A few differences of opinion:
Keith Moore wrote:
> > But I do think that use of
> > multiple CESs in a new protocol should require substantial
> > justification, and that UTF-8 should be presumed to be the CES of
> > choice for any new protocol that requires ASCII compatibility for its
> > character representation.

There may well be areas of application for new protocols which cannot fully
support Unicode which underlies use of utf-8, due to character set size,
huge tables needed for normalization, etc. (see sections 3.1 (paying particular
attention to "memory-starved microprocessors") and 3.4 of RFC 1958).  Not all
protocols need to fully support utf-8 directly; the highly successful mail
system, for example, supports only a subset of ANSI X3.4 in message header
fields, yet it allows pass-through of utf-8 and other charsets via RFC 2047
mechanisms as amended by RFC 2231 and errata.

Ted Hardie wrote:
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.

To be precise, RFC 2277 says:
"   Protocols MUST be able to use the UTF-8 charset, which consists of
   the ISO 10646 coded character set combined with the UTF-8 character
   encoding scheme, as defined in [10646] Annex R (published in
   Amendment 2), for all text.

   Protocols MAY specify, in addition, how to use other charsets or
   other character encoding schemes for ISO 10646, such as UTF-16, but
   lack of an ability to use UTF-8 is a violation of this policy; such a
   violation would need a variance procedure ([BCP9] section 9) with
   clear and solid justification in the protocol specification document
   before being entered into or advanced upon the standards track.

   For existing protocols or protocols that move data from existing
   datastores, support of other charsets, or even using a default other
   than UTF-8, may be a requirement. This is acceptable, but UTF-8
   support MUST be possible.

   When using other charsets than UTF-8, these MUST be registered in the
   IANA charset registry, if necessary by registering them when the
   protocol is published.
"
Several points:
1. "MUST be able to use" is a bit different from "requires" (see the above
   example of the mail system, which is able to use utf-8 by the mechanisms
   noted, but which does not require and in fact cannot directly accommodate
   raw utf-8).
2. The explicitly stated policy of allowing alternative charsets is important.
3. Most important, note that 2277 explicitly requires registration.

> I have no problem with UTF-16 or
> UTF-32 if there is a compelling reason to allow them,

Well neither (as well as their "BE" and "LE" variants) is suitable for use
with MIME text types, which precludes their use in a number of important
applications.  And one thing the charset registry sorely needs is a more
explicit indication of which charsets are/are not suitable for such use
(heck, some registrations have lacked the required statement of
[un]suitability, so even groping through all of the registrations is of
no use (and don't get me started on RFC 1345 issues)).

Picon
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> On Wed September 6 2006 18:58, Ned Freed wrote:
> > > I concur with the need to maintain the current charset registry to
> > > support legacy apps that use it.
> > 
> > > And I think Ned would be an excellent choice for reviewer, though it
> > > wouldn' t bother me if he could have the assistance of people with
> > > specialized expertise in Asian writing schemes.
> > 
> > Any such assistance would be hugely welcome. As an aside, it would also be nice
> > if more people would post comments to the list...
> 
> OK.  I concur with most of what has already been said by others, specifically
> that if a charset (i.e. something meeting the definition of charset) is in
> use, it ought to be registered; using the registry as a way to force some
> agenda is a very bad idea.  Also that Ned would be an excellent choice for
> reviewer, and I would add that I fully support his stated plan to overhaul
> the existing registry, which has long been in need of such an overhaul (e.g.
> the registration procedure has long said that "ASCII" is disallowed, yet it
> is in fact registered as an alias).

There seems to be a problem here, but maybe it whould then be the
procedures that be revised, as ASCII is a well known name for a specific
character set.

I can agree to that ascii not be the recommended name for the specific
charset.

Best regards
keld

Bruce Lilly | 7 Sep 23:17
Picon

Re: Volunteer needed to serve as IANA charset reviewer

[cc's trimmed]
On Thu September 7 2006 11:56, Keld Jørn Simonsen wrote:
> On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> > the registration procedure has long said that "ASCII" is disallowed, yet it
> > is in fact registered as an alias).
> 
> There seems to be a problem here, but maybe it whould then be the
> procedures that be revised, as ASCII is a well known name for a specific
> character set.

Quoting RFC 2046:
"  The character set name "ASCII" is reserved and must not
   be used for any purpose.
"

The same text appeared in RFC 1521 and in RFC 1341, dated 1992.

[corrigenda]  The current registration procedures per se do not
forbid "ASCII", but MIME initially established charset registrations,
and the "for any purpose" certainly seems clear.  Quoting RFC 1341:
"           Several other MIME fields, notably
            including character set names, are likely to have new values
            defined  over time.  In order to ensure that the set of such
            values is  developed  in  an  orderly,  well-specified,  and
            public  manner,  MIME  defines  a registration process which
            uses the Internet Assigned Numbers  Authority  (IANA)  as  a
            central  registry  for  such  values.
"
And RFC 1341 Appendix F section 2 is, as far as I know, the initial
(abbreviated) character set registration procedure.

So at least as far as MIME is concerned, "ASCII" has always been
forbidden; the default and preferred MIME name for ANSI X3.4 is
"US-ASCII".

One problem is that "ASCII" has been [mis]used for things other than
one specific character set and is therefore not unambiguous.

Also, we should distinguish informal usage from registered names
used in protocols.

As with most IANA registries, it would be quite unwise to remove something
once registered.  So I wouldn't want to simply remove "ASCII" leaving no
trace in case there is some archived content which used that alias in spite
of the prohibition against such use.  I would support a mechanism to mark
(clearly, and in the registry) a name as deprecated, along with a
"MUST NOT generate" rule applicable to deprecated names.

------------------

Another footnote: By noting that Ned would make a fine charset reviewer,
I am not indicating any fault with Paul Hoffman, who is still listed as
charset reviewer on the IANA site (http://www.iana.org/numbers.html#C)
and who has done a fine job as evidenced by his past participation on this
mailing list. 

Claus Färber | 16 Sep 03:44
Picon

Re: Volunteer needed to serve as IANA charset reviewer

Bruce Lilly schrieb:
> One problem is that "ASCII" has been [mis]used for things other than
> one specific character set and is therefore not unambiguous.

Maybe it should be possible to register charsets that _are_ ambigous. Of 
course, there should be a warning not to use them if at all possible. 
Still, some applications which don't know the charset (converters from 
other formats) might make use of them if its not feasible to detect the 
true charset.

"ASCII" could be an alias for "UNKNOWN-7BIT" (or "UNKNOWN-ISO-646", 
"UNKNOWN-ASCII"?)

Other possible ambigous charsets could be:

"UNKNOWN-8BIT" (already used by some mail transport agents when 
MIMEifying messages).
"UNKNOWN-EBCDIC"
"UNKNOWN-UTF16" with alias "UNICODE".
"UNKNOWN-ISO-8859" with alias "ANSI".
"UNKNOWM-IBMPC" with alias "OEM".

Claus

Frank Ellermann | 17 Sep 14:50
Picon
Picon

Re: Volunteer needed to serve as IANA charset reviewer

Claus Färber wrote:

> "UNKNOWN-8BIT" (already used by some mail transport agents

First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
already registered.

> "UNKNOWN-UTF16"

What's the difference from UTF-16 ?

> with alias "UNICODE".

Ugh, thanks, but no thanks.

> "UNKNOWN-ISO-8859" with alias "ANSI".
> "UNKNOWM-IBMPC" with alias "OEM".

One of those could do, "unknown-ascii-8bit", alias "oem".

Frank

Claus Färber | 1 Oct 20:18
Picon

Re: Volunteer needed to serve as IANA charset reviewer

Frank Ellermann schrieb:
> Claus Färber wrote:
>> "UNKNOWN-8BIT" (already used by some mail transport agents
> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
> already registered.

Oops.

>> "UNKNOWN-UTF16"
> What's the difference from UTF-16 ?

UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, 
RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a fall back.

>> with alias "UNICODE".
> Ugh, thanks, but no thanks.

The idea is to deprecate the label "UNICODE" by tying it to an 
incompletly specified charset.

>> "UNKNOWN-ISO-8859" with alias "ANSI".
>> "UNKNOWM-IBMPC" with alias "OEM".
> 
> One of those could do, "unknown-ascii-8bit", alias "oem".

We already have UNKNOWN-8BIT.

When you convert legacy data, you often DO know that something is in a 
DOSish (IBMPC-based) or Windowsish (ANSI-based) charset. Having charset 
labels to carry this information (instead of the unspecified 
UNKNOWN-8BIT) is a good idea.

Claus

Martin Duerst | 2 Oct 11:38
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

Ned and me, as newly appointed charset reviewers,
plan to first address pending registrations, and once
they are dealt with, looking at ways to clean up the
registry.

At 03:18 06/10/02, Claus F舐ber wrote:
>Frank Ellermann schrieb:
>> Claus F舐ber wrote:
>>> "UNKNOWN-8BIT" (already used by some mail transport agents
>> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
>> already registered.
>
>Oops.

 From a purely personal viewpoint, this one actually occasionally
came in handy.

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?
>
>UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would
not have such a fall back.

Has UNKNOWN-UTF-16 been proposed formally, or is this just an
idea floated in an email? As a reviewer, I'd prefer to deal
with "really existing charsets" first.

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.
>
>The idea is to deprecate the label "UNICODE" by tying it to an incompletly specified charset.

Personally, I agree with the idea of deprecating "Unicode".
As a charset reviewer, I think this should be done by just
noting the entry as DECRECATED or OBSOLETE or some such,
rather than by registering additional aliases.

>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".
>> One of those could do, "unknown-ascii-8bit", alias "oem".
>
>We already have UNKNOWN-8BIT.
>
>When you convert legacy data, you often DO know that something is in a DOSish (IBMPC-based) or Windowsish
(ANSI-based) charset. Having charset labels to carry this information (instead of the unspecified
UNKNOWN-8BIT) is a good idea.

To repeat, as a reviewer, I'd prefer to deal with "really existing
charsets" first. We may be able to consider ideas such as these
later, if we look at more and less precise labels for encodings
(e.g. labels to indicate various variants of Shift_JIS).

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

Mark Davis | 2 Oct 17:22
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

I'd suggest taking a look at the ICU charset data. This was gathered by calling APIs on different platforms, instead of going by the documentation, which was often false.

http://icu.sourceforge.net/charts/charset/
http://icu.sourceforge.net/charts/charset/roundtripIndex.html

The other thing that needs to be done is establish criteria for identity. If two mappings are identical except that one adds an additional mapping from bytes to Unicode, which gets registered? Both? The subset? The superset?

There are literally hundreds of such cases, so without clarity it doesn't help to propose registrations.

Mark

On 10/2/06, Martin Duerst < duerst <at> it.aoyama.ac.jp> wrote:
Ned and me, as newly appointed charset reviewers,
plan to first address pending registrations, and once
they are dealt with, looking at ways to clean up the
registry.

At 03:18 06/10/02, Claus F舐ber wrote:
>Frank Ellermann schrieb:
>> Claus F舐ber wrote:
>>> "UNKNOWN-8BIT" (already used by some mail transport agents
>> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
>> already registered.
>
>Oops.

From a purely personal viewpoint, this one actually occasionally
came in handy.

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?
>
>UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a fall back.

Has UNKNOWN-UTF-16 been proposed formally, or is this just an
idea floated in an email? As a reviewer, I'd prefer to deal
with "really existing charsets" first.

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.
>
>The idea is to deprecate the label "UNICODE" by tying it to an incompletly specified charset.

Personally, I agree with the idea of deprecating "Unicode".
As a charset reviewer, I think this should be done by just
noting the entry as DECRECATED or OBSOLETE or some such,
rather than by registering additional aliases.

>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".
>> One of those could do, "unknown-ascii-8bit", alias "oem".
>
>We already have UNKNOWN-8BIT.
>
>When you convert legacy data, you often DO know that something is in a DOSish (IBMPC-based) or Windowsish (ANSI-based) charset. Having charset labels to carry this information (instead of the unspecified UNKNOWN-8BIT) is a good idea.

To repeat, as a reviewer, I'd prefer to deal with "really existing
charsets" first. We may be able to consider ideas such as these
later, if we look at more and less precise labels for encodings
(e.g. labels to indicate various variants of Shift_JIS).


Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#   http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp


Martin Duerst | 3 Oct 07:17
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

Hello Mark,

We should definitely start to look at such issues once we have
processed the backlog of requests and have cleaned up some of
the garbage in the current registry.

On the side, I think it would be great if
<http://icu.sourceforge.net/charts/charset/roundtripIndex.html>http://icu.sourceforge.net/charts/charset/roundtripIndex.html
could be split up into some smaller pages. It's really huge.

Regards,    Martin.

At 00:22 06/10/03, Mark Davis wrote:
>I'd suggest taking a look at the ICU charset data. This was gathered by calling APIs on different
platforms, instead of going by the documentation, which was often false.
>
><http://icu.sourceforge.net/charts/charset/>http://icu.sourceforge.net/charts/charset/
>http://icu.sourceforge.net/charts/charset/roundtripIndex.html
>
>The other thing that needs to be done is establish criteria for identity. If two mappings are identical
except that one adds an additional mapping from bytes to Unicode, which gets registered? Both? The
subset? The superset? 
>
>There are literally hundreds of such cases, so without clarity it doesn't help to propose registrations.
>
>Mark

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

Frank Ellermann | 2 Oct 02:59
Picon
Picon

unknown-xyz (was: Volunteer needed to serve as IANA charset reviewer)

Claus Frber wrote:

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?

> UTF-16 "SHOULD be interpreted as being big-endian" if there's
> no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a
> fall back.

Okay, but with a good excuse violating a SHOULD is possible...

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.

> The idea is to deprecate the label "UNICODE" by tying it to
> an incompletly specified charset.

...sneaky <g>

In reality that boils down to "any even number of octets not
including 0xfeff or 0xfffe", or do I miss something ?  Who
could be interested in that difference from "unknown-8bit" ?

---
>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".

>> One of those could do, "unknown-ascii-8bit", alias "oem".

> We already have UNKNOWN-8BIT.
> When you convert legacy data, you often DO know that 
> something is in a DOSish (IBMPC-based) or Windowsish
> (ANSI-based) charset. Having charset labels to carry
> this information (instead of the unspecified UNKNOWN-8BIT)
> is a good idea.

Yes, but why the difference, who's supposed to guess what's
what, and who's interested in the dubious outcome of such
guesses ?

If I screw-up what you get is a bogus "Latin-1", and you can
correctly guess that it must be bogus as soon as you find any
C1 octets.  But without human intervention you don't know how
I screwed up, it's windows-1252, pc-multilingual-850+euro, or
worse (cp437, wild mixtures, who knows).

An "unknown-ascii-8bit" => neither ISO-8859-x nor UTF-8, but
at least MIME compatible (one hopes).

The W3C validator could make use of that "unknown-ascii-8bit",
one error for that (if it's only a guess), but then continue
to report unrelated interesting errors.

Frank
--

-- 
Honk for 4234 to STD

Picon
Favicon

Re: Volunteer needed to serve as IANA charset reviewer

On Thu, Sep 07, 2006 at 05:17:18PM -0400, Bruce Lilly wrote:
> [cc's trimmed]
> On Thu September 7 2006 11:56, Keld Jørn Simonsen wrote:
> > On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> > > the registration procedure has long said that "ASCII" is disallowed, yet it
> > > is in fact registered as an alias).
> > 
> > There seems to be a problem here, but maybe it whould then be the
> > procedures that be revised, as ASCII is a well known name for a specific
> > character set.
> 
> Quoting RFC 2046:
> "  The character set name "ASCII" is reserved and must not
>    be used for any purpose.
> "

Well, that is fine for me, we can have the name registered but not used
for any purpose in IETF specs. I think this is what we meant with this
statement, when we wrote it.

> So at least as far as MIME is concerned, "ASCII" has always been
> forbidden; the default and preferred MIME name for ANSI X3.4 is
> "US-ASCII".

Agree

> One problem is that "ASCII" has been [mis]used for things other than
> one specific character set and is therefore not unambiguous.

Agree

> Also, we should distinguish informal usage from registered names
> used in protocols.
> 
> As with most IANA registries, it would be quite unwise to remove something
> once registered.  So I wouldn't want to simply remove "ASCII" leaving no
> trace in case there is some archived content which used that alias in spite
> of the prohibition against such use.  I would support a mechanism to mark
> (clearly, and in the registry) a name as deprecated, along with a
> "MUST NOT generate" rule applicable to deprecated names.

Also agree with you here.

Best regards
Keld

Tim Bray | 7 Sep 01:27
Favicon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

On Sep 6, 2006, at 2:45 PM, Keith Moore wrote:

> As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages.  IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.  But I do think that  
> use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

Agreed on all counts.  Section 5.1 of RFC3470 (aka BCP70) says smart  
things about this, referencing 2277.  Basically, if you're going to  
use XML, there's probably no point trying to legislate against UTF-16  
since any conformant reader is required to accept it, and in practice  
all known XML software can handle 8859 and Shift-JIS and EUC.   But  
if you're not doing XML, compulsory UTF-8 removes a lot of failure  
points without costing much.

   -Tim

Martin Duerst | 8 Sep 12:02
Picon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

At 06:45 06/09/07, Keith Moore wrote:
>I concur with the need to maintain the current charset registry to
>support legacy apps that use it.  

I concur with Keith (and it seems almost everybody else) that we
still need a charset registry.

>And I think Ned would be an excellent choice for reviewer, though it
>wouldn' t bother me if he could have the assistance of people with
>specialized expertise in Asian writing schemes.

He would certainly have my assistance, for whatever it's worth.

>As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
>specifying Unicode isn't sufficient given the potential for
>incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
>pessimizes storage and transmission of text written in certain
>languages.

True. The most affected languages are not CJK (Chinese, Japanese, Korean),
but all the scripts that have most of their characters beyond
U+0800 but don't need two bytes to encode the particular script,
i.e. all the Indian Scripts, and so on. A serious part of the
overhead is often (but not always) compensated by the fact that
protocol or markup information is usually heavily ascii-biased.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst <at> it.aoyama.ac.jp     

Tim Bray | 7 Sep 01:27
Favicon
Gravatar

Re: Volunteer needed to serve as IANA charset reviewer

On Sep 6, 2006, at 2:45 PM, Keith Moore wrote:

> As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.  And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages.  IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.  But I do think that  
> use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

Agreed on all counts.  Section 5.1 of RFC3470 (aka BCP70) says smart  
things about this, referencing 2277.  Basically, if you're going to  
use XML, there's probably no point trying to legislate against UTF-16  
since any conformant reader is required to accept it, and in practice  
all known XML software can handle 8859 and Shift-JIS and EUC.   But  
if you're not doing XML, compulsory UTF-8 removes a lot of failure  
points without costing much.

   -Tim


Gmane