I think some languages pick an encoding for characters/strings and use that where appropriate in their runtime and libraries.
Java picked UTF-16, has a native char type, and made every char/string bear the overhead, and does magic for when a char extends beyond 2 bytes.
Go says a string is an immutable slice of bytes. It has some built in support for UTF-8 (in the range command), but says you do not have to bear the compiler/runtime overhead of analyzing the bytes until you need to. This is typically when you need to iterate over the actual characters, or slice into valid strings. At that time, your application takes the computation hit. There are libraries, and built-in functions to help with that:
- range builtin for iterating over utf-8 encoded strings
- unicode/utf8,utf16 for walking runes in either format
- exp/utf8string for caching the results of a utf-8 string introspection like length, runes, etc (ala what Java does)
Coming from a deep java background, I've grown to really really appreciate the Go approach in practice. I've missed nothing and gained a lot from the Go design. I think it helps that the initial UTF-8 author is behind this.
As Rob mentioned, there's further deeper support coming for unicode. See exp/norm for some beginnings.
On Monday, April 2, 2012 3:31:04 AM UTC-4, LRN wrote:
On Monday, April 2, 2012 5:48:12 AM UTC+4, mark.edw...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:However, using an 8-bit Unicode string to contain arbitrary byte data sounds like a mistake to me. You'd never know if the sequences of valid UTF-8 in the byte data were really characters or something else, a SJIS source (or a JPEG source) could well contain fragments of valid UTF-8. Iterating through the SJIS would get the wrong runes, wouldn't divide up on the right character boundaries, etc. Programming languages that handle this carefully distinguish between an array of bytes and an 8-bit Unicode string. Often the separation was made after it became obvious that mixing them up causes no end of errors: an example is
http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Exactly the point.
Every time a program gets a chunk of bytes from anything that is not directly controlled by it (from another program or from a file), it HAS to validate this chunk of bytes to be an UTF-8-encoded (or not, depending on, say, file format) string. "unicode/utf8" package does that, but you have to invoke the right functions yourself when it is appropriate.
In almost all cases strings are just byte arrays. The fact that compiler understands UTF-8-encoded string literals (and is able to produce byte arrays with the right contents) is irrelevant. The use of runes and UTF-8 string literals might have contributed to the confusion, but Go simply doesn't have 8-bit Unicode strings.
The only exception is the for loop range expression, which iterates over
_runes_ rather than bytes, and (AFAICS from the language specs) assumes
that string is UTF-8-encoded. And it does its kinda-validation only one
rune at a time.
The exception is probably for increased usability - UTF-8 is one of the best encodings for strings, so whenever someone wants to iterate over a string, it makes sense to use UTF-8. Unlike Python, Go doesn't have generators (well, you could write something like that with channels... 3-clause for-loop could also be used), so there had to be one and only one way to iterate over a string with a range clause - and that happened to be one-rune-out-of-UTF-8-at-a-time way (because for bytes you have []byte, and choosing any other encoding over UTF-8 would have been suboptimal).
Python3 can afford the luxury of storing strings in UCS-2/UCS-4, Go can't. As for conversion, this is one of the pains you get when moving to Python 3 (you have to explicitly convert from bytes to string, specifying an encoding scheme).
But i agree with the OP, FAQ should explain all that.