Karl | 30 Mar 21:17 2012
Picon

Unicode FAQ, and other impressions

Hi,

One thing I found surprisingly different / radical about GO was its
unicode handling.  strings are compared bytewise, length returns
number of bytes, [] returns a byte, yet all string literals are
assumed to be utf-8 unicode and the for loop iterates using runes.
That's a bit odd, and yet there seems to be minimal doc and rationale
on this.  How many newbies won't realize a map of strings doesn't do
unicode normalization of the keys?  How exactly do I get a map of
UnicodeNormalizedStrings?   Where is the Unicode FAQ, for people
coming from other languages?  For gods sake, the wikipedia page for go
has one word matching 'unicode'... and your faq isn't any better!

Which brings me to my next point.. the doc on map sucks.  "The key can
be of any type for which the equality operator is defined"  excuse me,
what is this operator and how do I define it on my type?  Or am I not
able to at all?

Anyway, THANK you for properly zeroing memory from new().  (Objective
C does that too).  I HATE worrying about and trying to find
uninitialized memory problems via valgrind for c++.  Thank you for not
crashing when returning a pointer to a local var, and abstracting any
differences for variables living on the heap and on stack.    Thank
you for having a clean syntax (named return parameters, consistent
ordering) and gofmt (I initially didn't like x*x having no space but
now I see its due to binding precedence).  I like the easy doc /
comment syntax too.

Cheers on V1, I have no immediate plans to use it but like as someone
who has done a lot of c++ I like the direction.  I think c++ won't be
(Continue reading)

Ian Lance Taylor | 30 Mar 21:46 2012
Picon

Re: Unicode FAQ, and other impressions

Karl <karl.pickett@...> writes:

> One thing I found surprisingly different / radical about GO was its
> unicode handling.  strings are compared bytewise, length returns
> number of bytes, [] returns a byte, yet all string literals are
> assumed to be utf-8 unicode and the for loop iterates using runes.

String literals need not contain valid UTF-8.  You can use \x and
friends to put arbitrary bytes in string literals.

> That's a bit odd, and yet there seems to be minimal doc and rationale
> on this.  How many newbies won't realize a map of strings doesn't do
> unicode normalization of the keys?  How exactly do I get a map of
> UnicodeNormalizedStrings?   Where is the Unicode FAQ, for people
> coming from other languages?  For gods sake, the wikipedia page for go
> has one word matching 'unicode'... and your faq isn't any better!

We are actively working on support for normalization, collation, etc.,
but it's a complex subject and it's important to get it right.  You can
see some initial ideas in the exp/norm package.

> Which brings me to my next point.. the doc on map sucks.  "The key can
> be of any type for which the equality operator is defined"  excuse me,
> what is this operator and how do I define it on my type?  Or am I not
> able to at all?

You can not define equality operators.  I assume you are quoting
Effective Go.  Right after that clause, the document lists the types
that can be used as map key types.

(Continue reading)

karl.pickett | 30 Mar 22:17 2012
Picon

Re: Unicode FAQ, and other impressions

On Friday, March 30, 2012 2:46:02 PM UTC-5, Ian Lance Taylor wrote:

Karl <karl.pickett-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> One thing I found surprisingly different / radical about GO was its
> unicode handling.  strings are compared bytewise, length returns
> number of bytes, [] returns a byte, yet all string literals are
> assumed to be utf-8 unicode and the for loop iterates using runes.

String literals need not contain valid UTF-8.  You can use \x and
friends to put arbitrary bytes in string literals.


yes, I noticed that and validated I got the replacement character (err..rune).
 

> Which brings me to my next point.. the doc on map sucks.  "The key can
> be of any type for which the equality operator is defined"  excuse me,
> what is this operator and how do I define it on my type?  Or am I not
> able to at all?

You can not define equality operators.  I assume you are quoting
Effective Go.  Right after that clause, the document lists the types
that can be used as map key types.


The words 'any type' give a large amount of false hope, but thank you for crushing it.  I would suggest placing 'You can not define equality operators' close near by.  i.e. (The equality operator, ==, can not be defined by a user).

Anyway, if there becomes a good GO unicode faq that compares vs python 3, for example its utf8b round tripping to system interfaces, I'd like to read it. I fortunately don't have to do much unicode stuff, but I do enjoy lurking on it.  
 

Ian

Martin Geisler | 31 Mar 12:16 2012
Picon

Re: Unicode FAQ, and other impressions

Ian Lance Taylor <iant@...> writes:

> Karl <karl.pickett@...> writes:
>
>> One thing I found surprisingly different / radical about GO was its
>> unicode handling. strings are compared bytewise, length returns
>> number of bytes, [] returns a byte, yet all string literals are
>> assumed to be utf-8 unicode and the for loop iterates using runes.
>
> String literals need not contain valid UTF-8. You can use \x and
> friends to put arbitrary bytes in string literals.

Why is that even allowed? Why not require people to use a []byte if
they're going to pass around arbitrary bytes?

The answer might be that a []byte is not immutable, but that seems to be
an orthogonal issue, i.e., using a string where you really want an
immutable []byte feels like a misuse of string to me.

--

-- 
Martin Geisler

Mercurial links: http://mercurial.ch/
Jan Mercl | 31 Mar 12:57 2012
Picon

Re: Unicode FAQ, and other impressions



On Saturday, March 31, 2012 12:16:49 PM UTC+2, Martin Geisler wrote:
Ian Lance Taylor writes:

> Karl writes:
>
>> One thing I found surprisingly different / radical about GO was its
>> unicode handling. strings are compared bytewise, length returns
>> number of bytes, [] returns a byte, yet all string literals are
>> assumed to be utf-8 unicode and the for loop iterates using runes.
>
> String literals need not contain valid UTF-8. You can use \x and
> friends to put arbitrary bytes in string literals.

Why is that even allowed? Why not require people to use a []byte if
they're going to pass around arbitrary bytes?

- Those "arbitrary bytes" could easily be also e.g. strings encoded in any legal/valid encoding which happens not to be UTF-8. Would they not be allowed to be a value of a string typed entity, those encodings would become second class citizens in Go, which is not favorable.

- Any arbitrary encoding scheme (e.g. a gob) being a string can be used as a map key. []byte cannot. Probably even much more important than the previous reason.

Rob 'Commander' Pike | 31 Mar 14:30 2012

Re: Unicode FAQ, and other impressions

Strings are *not* required to be UTF-8. Go source code *is* required
to be UTF-8. There is a complex path between the two.

In short, there are three kinds of strings, and you're conflating
them, a common misunderstanding. They are:

1) the substring of the source that lexes into a string literal.
2) a string literal.
3) a value of type string.

Only the first is required to be UTF-8. The second is required to be
written in UTF-8, but its contents are interpreted various ways (*)
and may encode arbitrary bytes. The third can contain any bytes at
all.

Try this on:

var s string = "\xFF語"

Source substring: "\xFF語", UTF-8 encoded. The data:
22
5c
78
46
46
e8
aa
9e
22

String literal: \xFF語 (between the quotes). The data:
5c
78
46
46
e8
aa
9e

The string value (unprintable; this is a UTF-8 stream). The data:

ff
e8
aa
9e

And for record, the characters (code points):

<erroneous byte FF, will appear as U+FFFD if you range over the string value>
語 U+8a9e

Please make a note of it.

-rob

* Examples:
\u1234 \U00012345 \377 ÿ 語 correspond to various numbers of bytes in UTF-8
\xFF encodes exactly one byte, not UTF-8

Martin Geisler | 31 Mar 17:15 2012
Picon

Re: Unicode FAQ, and other impressions

"Rob 'Commander' Pike" <r <at> golang.org> writes:

> Strings are *not* required to be UTF-8. Go source code *is* required
> to be UTF-8. There is a complex path between the two.
>
> In short, there are three kinds of strings, and you're conflating
> them, a common misunderstanding. They are:
>
> 1) the substring of the source that lexes into a string literal.
> 2) a string literal.
> 3) a value of type string.

This is all perfectly clear.

My question was really why the string type is allowed to carry non-UTF-8
data. Especially when the built-in range construct has a clear
preference for UTF-8 encoded characters.

--

-- 
Martin Geisler

Mercurial links: http://mercurial.ch/
Rob 'Commander' Pike | 31 Mar 23:40 2012

Re: Unicode FAQ, and other impressions

Performance for one thing. If the string must always be valid UTF-8
then relatively expensive validation is required for many operations.
Plus making those operations able to fail complicates the interface.
Better to put such things in libraries.

Interoperability for another. What happens if you're reading a
database and it returns a string in Shift-JIS? If you know it's
Shift-JIS, you can plan for it and use the language to help you. If Go
said "no" that would not be helpful.

But mostly it's because there's no compelling need to restrict them
this way. The programmer should have some freedom in their use.

-rob

mark.edward.davis | 2 Apr 03:48 2012
Picon

Re: Unicode FAQ, and other impressions

The Unicode standard also defines an "8-bit Unicode string", which is broader than a UTF-8 string. However, those subsegments of the string that are valid UTF-8 are interpreted as UTF-8. The purpose is both performance and functionality: an 8-bit unicode string that contains valid UTF-8 can be split at any byte boundary into two pieces. Those are both valid 8-bit Unicode strings, even though they are not valid UTF-8 strings. Those fragments can be combined back into the original 8-bit Unicode string without problem. (If an implementation always forced fragments to be valid UTF-8, that wouldn't work; you'd either get an exception or have the data mashed (eg by FFFD or 1A)).


However, using an 8-bit Unicode string to contain arbitrary byte data sounds like a mistake to me. You'd never know if the sequences of valid UTF-8 in the byte data were really characters or something else, a SJIS source (or a JPEG source) could well contain fragments of valid UTF-8. Iterating through the SJIS would get the wrong runes, wouldn't divide up on the right character boundaries, etc. Programming languages that handle this carefully distinguish between an array of bytes and an 8-bit Unicode string. Often the separation was made after it became obvious that mixing them up causes no end of errors: an example is http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

On Saturday, March 31, 2012 2:40:05 PM UTC-7, r wrote:
Performance for one thing. If the string must always be valid UTF-8
then relatively expensive validation is required for many operations.
Plus making those operations able to fail complicates the interface.
Better to put such things in libraries.

Interoperability for another. What happens if you're reading a
database and it returns a string in Shift-JIS? If you know it's
Shift-JIS, you can plan for it and use the language to help you. If Go
said "no" that would not be helpful.

But mostly it's because there's no compelling need to restrict them
this way. The programmer should have some freedom in their use.

-rob

Rob 'Commander' Pike | 2 Apr 05:18 2012

Re: Unicode FAQ, and other impressions

Go does not have Unicode strings. Go has some, modest support for
Unicode in strings, but that's a different point. As I tried to
explain earlier in this thread, this a frequent source of
misunderstanding.

There will be a more complete story for Unicode in Go at some point,
but what's there now is less than rudimentary.

-rob

LRN | 2 Apr 09:31 2012
Picon

Re: Unicode FAQ, and other impressions



On Monday, April 2, 2012 5:48:12 AM UTC+4, mark.edw...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
However, using an 8-bit Unicode string to contain arbitrary byte data sounds like a mistake to me. You'd never know if the sequences of valid UTF-8 in the byte data were really characters or something else, a SJIS source (or a JPEG source) could well contain fragments of valid UTF-8. Iterating through the SJIS would get the wrong runes, wouldn't divide up on the right character boundaries, etc. Programming languages that handle this carefully distinguish between an array of bytes and an 8-bit Unicode string. Often the separation was made after it became obvious that mixing them up causes no end of errors: an example is http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Exactly the point.
Every time a program gets a chunk of bytes from anything that is not directly controlled by it (from another program or from a file), it HAS to validate this chunk of bytes to be an UTF-8-encoded (or not, depending on, say, file format) string. "unicode/utf8" package does that, but you have to invoke the right functions yourself when it is appropriate.
In almost all cases strings are just byte arrays. The fact that compiler understands UTF-8-encoded string literals (and is able to produce byte arrays with the right contents) is irrelevant. The use of runes and UTF-8 string literals might have contributed to the confusion, but Go simply doesn't have 8-bit Unicode strings.
The only exception is the for loop range expression, which iterates over _runes_ rather than bytes, and (AFAICS from the language specs) assumes that string is UTF-8-encoded. And it does its kinda-validation only one rune at a time.

The exception is probably for increased usability - UTF-8 is one of the best encodings for strings, so whenever someone wants to iterate over a string, it makes sense to use UTF-8. Unlike Python, Go doesn't have generators (well, you could write something like that with channels... 3-clause for-loop could also be used), so there had to be one and only one way to iterate over a string with a range clause  - and that happened to be one-rune-out-of-UTF-8-at-a-time way (because for bytes you have []byte, and choosing any other encoding over UTF-8 would have been suboptimal).

Python3 can afford the luxury of storing strings in UCS-2/UCS-4, Go can't. As for conversion, this is one of the pains you get when moving to Python 3 (you have to explicitly convert from bytes to string, specifying an encoding scheme).

But i agree with the OP, FAQ should explain all that.
Ugorji Nwoke | 2 Apr 15:01 2012
Picon

Re: Unicode FAQ, and other impressions

To re-iterate what LRN said,


I think some languages pick an encoding for characters/strings and use that where appropriate in their runtime and libraries.

Java picked UTF-16, has a native char type, and made every char/string bear the overhead, and does magic for when a char extends beyond 2 bytes.

Go says a string is an immutable slice of bytes. It has some built in support for UTF-8 (in the range command), but says you do not have to bear the compiler/runtime overhead of analyzing the bytes until you need to. This is typically when you need to iterate over the actual characters, or slice into valid strings. At that time, your application takes the computation hit. There are libraries, and built-in functions to help with that:
- range builtin for iterating over utf-8 encoded strings
- unicode/utf8,utf16 for walking runes in either format
- exp/utf8string for caching the results of a utf-8 string introspection like length, runes, etc (ala what Java does)

Coming from a deep java background, I've grown to really really appreciate the Go approach in practice. I've missed nothing and gained a lot from the Go design. I think it helps that the initial UTF-8 author is behind this.

As Rob mentioned, there's further deeper support coming for unicode. See exp/norm for some beginnings.


On Monday, April 2, 2012 3:31:04 AM UTC-4, LRN wrote:


On Monday, April 2, 2012 5:48:12 AM UTC+4, mark.edw...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
However, using an 8-bit Unicode string to contain arbitrary byte data sounds like a mistake to me. You'd never know if the sequences of valid UTF-8 in the byte data were really characters or something else, a SJIS source (or a JPEG source) could well contain fragments of valid UTF-8. Iterating through the SJIS would get the wrong runes, wouldn't divide up on the right character boundaries, etc. Programming languages that handle this carefully distinguish between an array of bytes and an 8-bit Unicode string. Often the separation was made after it became obvious that mixing them up causes no end of errors: an example is http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Exactly the point.
Every time a program gets a chunk of bytes from anything that is not directly controlled by it (from another program or from a file), it HAS to validate this chunk of bytes to be an UTF-8-encoded (or not, depending on, say, file format) string. "unicode/utf8" package does that, but you have to invoke the right functions yourself when it is appropriate.
In almost all cases strings are just byte arrays. The fact that compiler understands UTF-8-encoded string literals (and is able to produce byte arrays with the right contents) is irrelevant. The use of runes and UTF-8 string literals might have contributed to the confusion, but Go simply doesn't have 8-bit Unicode strings.
The only exception is the for loop range expression, which iterates over _runes_ rather than bytes, and (AFAICS from the language specs) assumes that string is UTF-8-encoded. And it does its kinda-validation only one rune at a time.

The exception is probably for increased usability - UTF-8 is one of the best encodings for strings, so whenever someone wants to iterate over a string, it makes sense to use UTF-8. Unlike Python, Go doesn't have generators (well, you could write something like that with channels... 3-clause for-loop could also be used), so there had to be one and only one way to iterate over a string with a range clause  - and that happened to be one-rune-out-of-UTF-8-at-a-time way (because for bytes you have []byte, and choosing any other encoding over UTF-8 would have been suboptimal).

Python3 can afford the luxury of storing strings in UCS-2/UCS-4, Go can't. As for conversion, this is one of the pains you get when moving to Python 3 (you have to explicitly convert from bytes to string, specifying an encoding scheme).

But i agree with the OP, FAQ should explain all that.

Gmane