Shawn Steele | 9 Jul 2008 23:38
Picon
Favicon

Body parts

(That subject doesn't sound quite right out of context :)

I was looking at the drafts and trying to find a recommendation for the body part.  I see the requirement for 8
bit body parts, however I would like to see that the body SHOULD be encoded in UTF-8 for UTF8SMTP.  Is there
any such language and I just missed it?

Thanks,

- Shawn
Charles Lindsey | 10 Jul 2008 11:29
Picon
Picon

Re: Body parts

On Wed, 09 Jul 2008 22:38:52 +0100, Shawn Steele  
<Shawn.Steele <at> microsoft.com> wrote:

> (That subject doesn't sound quite right out of context :)
>
> I was looking at the drafts and trying to find a recommendation for the  
> body part.  I see the requirement for 8 bit body parts, however I would  
> like to see that the body SHOULD be encoded in UTF-8 for UTF8SMTP.  Is  
> there any such language and I just missed it?

Actually, there may be good reasons for using other charsets in body  
parts. Charsets in the higher reaches of Unicode require rather long  
strings of bytes per character in UTF-8. That is not a serious issue for  
headers, but body parts might well be significantly shorter in charsets  
more suited to the language in use.

Can anyone provide data on the efficiency of UTF-8 and BIG5, for example?

--

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131                       
   Web: http://www.cs.man.ac.uk/~chl
Email: chl <at> clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
John C Klensin | 10 Jul 2008 20:03

Re: Body parts


--On Thursday, 10 July, 2008 10:29 +0100 Charles Lindsey
<chl <at> clerew.man.ac.uk> wrote:

> Actually, there may be good reasons for using other charsets
> in body parts. Charsets in the higher reaches of Unicode
> require rather long strings of bytes per character in UTF-8.
> That is not a serious issue for headers, but body parts might
> well be significantly shorter in charsets more suited to the
> language in use.

While that clearly would have been a very big deal in the slow
and expensive Internet connections of, say, 20-odd years ago, I
imagine we could have an endless, and pointless, debate about
this subject in today's environment.

> Can anyone provide data on the efficiency of UTF-8 and BIG5,
> for example?

I'm sure someone will correct me if I get this wrong but, as I
understand it, Big5 is is a strictly two-byte (16 bit) character
set.  UTF-8 goes to three octets above U+07FF and the first CJK
character is U+3000 (that character and the ones close to it are
invalid for IDNA, but that is irrelevant here) and stays at
three octets until one gets out of plane 0.

So, one answer to your question is that for CJK characters that
are coded in the Unicode BMP (Plane 0), the storage ratio is
going to be strictly two octets of Big5 to three octets of
UTF-8.  If one starts mixing in characters from Plane 2, then
(Continue reading)

Martin Duerst | 11 Jul 2008 04:33
Picon
Gravatar

Re: Body parts

At 03:03 08/07/11, John C Klensin wrote:

>So, one answer to your question is that for CJK characters that
>are coded in the Unicode BMP (Plane 0), the storage ratio is
>going to be strictly two octets of Big5 to three octets of
>UTF-8.

Yes. That's the basic ratio.

>If one starts mixing in characters from Plane 2, then
>UTF-8 starts occupying four octets.  I don't know whether there
>are any such characters in Big5 but, if there are, the
>efficiency ratio would depend on the percentage of Plane 2
>characters and than, in turn, would depend on the specific
>corpus involved.

I don't think Big5 itself has Plane2 characters. Even
vendor extension stuff must have made it into extension A,
which is in the BMP. Even otherwise, unless it's a list
of characters in Plane 2 or something else Plane2-specific,
there'd probably be about 1 in a thousand or so only from
Plane2, so it's mostly irrelevant.

>But, coming back to my opening comments, at today's bandwidth,
>processing, and storage costs, it would take a rather large body
>of text for any of this to make any difference.

Yes. These days, I'm less worried by large emails, but more
by their total number :-(.

(Continue reading)

Abel | 10 Jul 2008 11:41
Picon

Re: Body parts

Dear Charles,
What kind of the efficiency ?
May you explain more.

BRs

Abel

> Can anyone provide data on the efficiency of UTF-8 and BIG5, for example?
Frank Ellermann | 10 Jul 2008 00:02
Picon
Picon

Re: Body parts

Shawn Steele wrote:

> I would like to see that the body SHOULD be encoded in UTF-8
> for UTF8SMTP.  Is there any such language and I just missed it?

No such language, you didn't miss it.  Sooner or later we will
use UTF-8 everywhere, but that is not directly related to EAI:

Users in an EAI-friendly environment might still have reasons
to prefer other charsets in their communications, e.g., if
some of their correspondents cannot handle UTF-8 well.  MUAs
supporting EAI likely offer UTF-8 as default, anything else
would be bizarre.  Adding a "you SHOULD do the obvious" norm
does not really help, or does it ?

 Frank

Gmane