Re: Body parts
Martin Duerst <duerst <at> it.aoyama.ac.jp>
2008-07-11 02:33:13 GMT
At 03:03 08/07/11, John C Klensin wrote:
>So, one answer to your question is that for CJK characters that
>are coded in the Unicode BMP (Plane 0), the storage ratio is
>going to be strictly two octets of Big5 to three octets of
>UTF-8.
Yes. That's the basic ratio.
>If one starts mixing in characters from Plane 2, then
>UTF-8 starts occupying four octets. I don't know whether there
>are any such characters in Big5 but, if there are, the
>efficiency ratio would depend on the percentage of Plane 2
>characters and than, in turn, would depend on the specific
>corpus involved.
I don't think Big5 itself has Plane2 characters. Even
vendor extension stuff must have made it into extension A,
which is in the BMP. Even otherwise, unless it's a list
of characters in Plane 2 or something else Plane2-specific,
there'd probably be about 1 in a thousand or so only from
Plane2, so it's mostly irrelevant.
>But, coming back to my opening comments, at today's bandwidth,
>processing, and storage costs, it would take a rather large body
>of text for any of this to make any difference.
Yes. These days, I'm less worried by large emails, but more
by their total number
.
(Continue reading)