Paul Koning | 19 May 2010 17:29
Picon
Favicon

wchar_t encoding?

Gents,

I'm working on a patch to gdb 7.1 to make it work on NetBSD.  The issue
is that GDB 7 uses iconv to handle character strings, and uses wide
chars internally so it can handle various non-ASCII scripts.

The trouble for NetBSD is that it asks iconv to translate to a character
set named "wchar_t".  That means "whatever the encoding is for the
wchar_t data type".  GNU libiconv supports that, so on platforms that
use that library things are fine.

NetBSD supports iconv, but it doesn't know the "wchar_t" encoding name.
So I proposed a patch that substitutes what appears to be used instead,
namely UCS-4 in platform native byte order (so "ucs-4le" on x86, for
example).  This seems to work.

The trouble is that I'm getting pushback on the patch, because of
concerns that the encoding used for wchar_t is not actually UCS-4.  In
particular, there is this article:
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wc
har_005ft-mess which says that on Solaris and FreeBSD the encoding of
wchar_t is "undocumented and locale dependent".  (Ye gods!)

Now, NetBSD is not FreeBSD... so... what is the answer for NetBSD?  Is
it like FreeBSD?  (If so, it would be good to fix that.)  Or is it a
fixed encoding, and if so, is it indeed ucs-4?

Thanks,
	paul

(Continue reading)

Martin Husemann | 19 May 2010 17:35
Picon

Re: wchar_t encoding?

On Wed, May 19, 2010 at 11:29:38AM -0400, Paul Koning wrote:
> NetBSD supports iconv, but it doesn't know the "wchar_t" encoding name.

It's probably easiest to add an alias for this and get that change pulled
up.

Martin

Paul Koning | 19 May 2010 19:55
Picon
Favicon

RE: wchar_t encoding?

> On Wed, May 19, 2010 at 11:29:38AM -0400, Paul Koning wrote:
> > NetBSD supports iconv, but it doesn't know the "wchar_t" encoding
> name.
> 
> It's probably easiest to add an alias for this and get that change
> pulled
> up.

That's one approach.  Another is to teach gdb to ask for a different
name, which isn't a particularly hard bit of configure machinery.

The problem is "alias for what?"

	paul

Martin Husemann | 19 May 2010 21:10
Picon

Re: wchar_t encoding?

On Wed, May 19, 2010 at 01:55:47PM -0400, Paul Koning wrote:
> The problem is "alias for what?"

Yes - I don't know, and I'd argue, gdb shouldn't know either ;-)

Martin

Paul Koning | 19 May 2010 21:44
Picon
Favicon

RE: wchar_t encoding?

> On Wed, May 19, 2010 at 01:55:47PM -0400, Paul Koning wrote:
> > The problem is "alias for what?"
> 
> Yes - I don't know, and I'd argue, gdb shouldn't know either ;-)

I guess I didn't explain well enough.

What's going on is this: the target being debugged has strings that gdb
needs to handle.  It is told what encoding is used for those strings
(via user commands, defaulting in some suitable way).

To allow for maximum flexibility, any internal processing on those
strings is done in wchar_t form.  Some of the work involves calling
various wide char support routines, like iswprint().  Those functions
assume (perhaps implicitly) a particular encoding, perhaps ucs-4,
perhaps something else.  For example, in Solaris the answer (apparently)
is "something else and it depends on the locale".

So when GDB reads a string from the target it feeds it to iconv and asks
it to convert from whatever was specified as the target's encoding into
"the encoding that the wchar support routines expect to find in wchar_t
data".

GDB doesn't particularly want to know what that encoding is, but it has
to ask for a specific encoding or iswprint() will get the wrong answer.
This is why libiconv supports the encoding name "wchar_t" in the first
place.

If it's possible to add that encoding name to the iconv in NetBSD, that
would be a good solution.  I tried to read the iconv code and got
(Continue reading)

Valeriy E. Ushakov | 20 May 2010 05:55
Picon

Re: wchar_t encoding?

Paul Koning <Paul_Koning <at> dell.com> wrote:

> I'm working on a patch to gdb 7.1 to make it work on NetBSD.  The issue
> is that GDB 7 uses iconv to handle character strings, and uses wide
> chars internally so it can handle various non-ASCII scripts.
> 
> The trouble for NetBSD is that it asks iconv to translate to a character
> set named "wchar_t".  That means "whatever the encoding is for the
> wchar_t data type".  GNU libiconv supports that, so on platforms that
> use that library things are fine.
>
> The trouble is that I'm getting pushback on the patch, because of
> concerns that the encoding used for wchar_t is not actually UCS-4.
> In particular, there is this article:
> http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess
> which says that on Solaris and FreeBSD the encoding of wchar_t is
> "undocumented and locale dependent".  (Ye gods!)

Why are they so surprised about that?  C99 says:

       3.7.3
       [#1] wide character
       bit  representation  that fits in an object of type wchar_t,
       capable of representing any character in the current locale

It's simply impossible to always use unicode as the only encoding for
wchar_t, since not all charsets are 1:1 with unicode.

Besides, iconv does not return (fsvo "return") wide strings, it
returns good old pointer to char.  Do they pass a pointer to wchar_t
(Continue reading)

Paul Koning | 20 May 2010 16:06
Picon
Favicon

RE: wchar_t encoding?

> http://www.gnu.org/software/libunistring/manual/libunistring.html#The-
> wchar_005ft-mess
> > which says that on Solaris and FreeBSD the encoding of wchar_t is
> > "undocumented and locale dependent".  (Ye gods!)
> 
> Why are they so surprised about that?  C99 says:
> 
>        3.7.3
>        [#1] wide character
>        bit  representation  that fits in an object of type wchar_t,
>        capable of representing any character in the current locale
> 
> It's simply impossible to always use unicode as the only encoding for
> wchar_t, since not all charsets are 1:1 with unicode.

That wasn't "they" -- the editorial comment was mine.  I thought that
Unicode by now is complete enough to be able to handle other charsets.
It sounds like that's not true, or at least wasn't 12 years ago.  Can
you give an example of a charset for which Unicode is not sufficient?

> Besides, iconv does not return (fsvo "return") wide strings, it
> returns good old pointer to char.  Do they pass a pointer to wchar_t
> as destination?

Yes.  The iconv documentation says that the arguments are buffer
pointers, so their type is whatever the source or destination encoding
name implies.

> If they just assume it's going to be a pointer to wide string, then
> correct implementation of "wchar_t" is for iconv to convert to a plain
(Continue reading)

Valeriy E. Ushakov | 20 May 2010 19:46
Picon

Re: wchar_t encoding?

Paul Koning <Paul_Koning <at> dell.com> wrote:

>> Or do they actually assume it's gonna be utf32?
> 
> No, that's exactly the issue.
> 
> The C99 rule you quoted says (or at least implies) that the encoding of
> wchar_t is locale dependent.  So the question is: how does a program
> find out WHAT encoding wchar_t uses right now?  I don't see any API for
> obtaining that information.  Clearly this is necessary -- how else can a
> program construct properly encoded wide char data if it needs to do so
> (as GDB does)?

There's api to convert between plain chars/strings and wide
chars/strings, there is stdio api for wide chars/strings.

Why is that necessary to know the wide char bit patterns?

SY, Uwe
--

-- 
uwe <at> stderr.spb.ru                       |       Zu Grunde kommen
http://snark.ptc.spbu.ru/~uwe/          |       Ist zu Grunde gehen

Paul Koning | 20 May 2010 21:57
Picon
Favicon

RE: wchar_t encoding?

> >> Or do they actually assume it's gonna be utf32?
> >
> > No, that's exactly the issue.
> >
> > The C99 rule you quoted says (or at least implies) that the encoding
> of
> > wchar_t is locale dependent.  So the question is: how does a program
> > find out WHAT encoding wchar_t uses right now?  I don't see any API
> for
> > obtaining that information.  Clearly this is necessary -- how else
> can a
> > program construct properly encoded wide char data if it needs to do
> so
> > (as GDB does)?
> 
> There's api to convert between plain chars/strings and wide
> chars/strings, there is stdio api for wide chars/strings.
> 
> Why is that necessary to know the wide char bit patterns?

Maybe it isn't.  I'm trying to solve the problem GDB needs to solve with
minimal changes to GDB.

What it needs to do: it's given a string (narrow or wide) on a target
system.  It's told (by the user, defaulted in some suitable way) what
encoding that string has.  GDB reads that string from memory.  It then
wants to do something with it, for example print it.  It also needs to
do some basic processing, for example test for non-printable characters.

The current scheme is, in outline:
(Continue reading)

Valeriy E. Ushakov | 20 May 2010 20:11
Picon

Re: wchar_t encoding?

Paul Koning <Paul_Koning <at> dell.com> wrote:

>> It's simply impossible to always use unicode as the only encoding for
>> wchar_t, since not all charsets are 1:1 with unicode.
> 
> That wasn't "they" -- the editorial comment was mine.  I thought that
> Unicode by now is complete enough to be able to handle other charsets.
> It sounds like that's not true, or at least wasn't 12 years ago.  Can
> you give an example of a charset for which Unicode is not sufficient?

I can invent an infinite number of them - it a matter of principle :),
the whole point is that C locale API (warts and all) is supposed to be
*completely* charset internals agnostic, you should be able to define
external locale information as groked by your C library, set your
LC_CTYPE &co accordinglly and a well behaved C program is supposed to
just work.

For a real life example, consider something like CSX (classical
sanskrit extended - a charset used to represent latin transliteration
of classical sanskrit).  It has e.g. a character for "r with dot below
with macron with acute".  Of course you can represent it using unicode
(you can inconv between csx and utf*), but you will need a sequence of
combining marks, i.e. it's not a 1:1 mapping, so a unicode wchar_t
cannot represent that character.

SY, Uwe
--

-- 
uwe <at> stderr.spb.ru                       |       Zu Grunde kommen
http://snark.ptc.spbu.ru/~uwe/          |       Ist zu Grunde gehen

(Continue reading)

Paul Koning | 20 May 2010 17:01
Picon
Favicon

RE: wchar_t encoding?

> > ...
> > The trouble for NetBSD is that it asks iconv to translate to a
> character
> > set named "wchar_t".  That means "whatever the encoding is for the
> > wchar_t data type".  GNU libiconv supports that, so on platforms
that
> > use that library things are fine.

I did some digging to see how libiconv implements that feature.

If  __LIBC_ISO_10646__ is defined then it simply aliases this to an
appropriate width Unicode (ucs2 or ucs4).  That applies to Linux, for
example.

If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
then it uses that function.  More precisely, a conversion to "wchar_t"
first converts to Unicode, which is then fed into mbrtowc to produce the
wchar_t encoding.  mbrtowc knows about any locale issues...

I guess that means that "multibyte" is Unicode, or UTF-8???  I don't see
that documented in any manpage.  It also means that if you have a source
character that's not in Unicode but is in whatever encoding wchar_t
uses, it would not be handled by the libiconv implementation of iconv()
because it uses Unicode as an intermediate form.

	paul

Valeriy E. Ushakov | 20 May 2010 19:58
Picon

Re: wchar_t encoding?

Paul Koning <Paul_Koning <at> dell.com> wrote:
>> > ...
>> > The trouble for NetBSD is that it asks iconv to translate to a
>> character
>> > set named "wchar_t".  That means "whatever the encoding is for the
>> > wchar_t data type".  GNU libiconv supports that, so on platforms
> that
>> > use that library things are fine.
> 
> I did some digging to see how libiconv implements that feature.
> 
> If  __LIBC_ISO_10646__ is defined then it simply aliases this to an
> appropriate width Unicode (ucs2 or ucs4).  That applies to Linux, for
> example.
> 
> If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
> then it uses that function.  More precisely, a conversion to "wchar_t"
> first converts to Unicode, which is then fed into mbrtowc to produce the
> wchar_t encoding.  mbrtowc knows about any locale issues...
>
> I guess that means that "multibyte" is Unicode, or UTF-8???  I don't see
> that documented in any manpage.  It also means that if you have a source
> character that's not in Unicode but is in whatever encoding wchar_t
> uses, it would not be handled by the libiconv implementation of iconv()
> because it uses Unicode as an intermediate form.

Yeah, this fallback seems bogus.  mbtowc &co exepct the source to be
in the current charset, so it's wrong to feed it unicode data (even if
wchar_t *is* always unicode internally).

(Continue reading)

James Chacon | 20 May 2010 20:30
Picon

Re: wchar_t encoding?

On Thu, May 20, 2010 at 8:01 AM, Paul Koning <Paul_Koning <at> dell.com> wrote:
>> > ...
>> > The trouble for NetBSD is that it asks iconv to translate to a
>> character
>> > set named "wchar_t".  That means "whatever the encoding is for the
>> > wchar_t data type".  GNU libiconv supports that, so on platforms
> that
>> > use that library things are fine.
>
> I did some digging to see how libiconv implements that feature.
>
> If  __LIBC_ISO_10646__ is defined then it simply aliases this to an
> appropriate width Unicode (ucs2 or ucs4).  That applies to Linux, for
> example.
>
> If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
> then it uses that function.  More precisely, a conversion to "wchar_t"
> first converts to Unicode, which is then fed into mbrtowc to produce the
> wchar_t encoding.  mbrtowc knows about any locale issues...
>
> I guess that means that "multibyte" is Unicode, or UTF-8???  I don't see
> that documented in any manpage.  It also means that if you have a source
> character that's not in Unicode but is in whatever encoding wchar_t
> uses, it would not be handled by the libiconv implementation of iconv()
> because it uses Unicode as an intermediate form.
>

I think part of your problem here is mixing terminology. Unicode is
not an encoding, it's simply a definition of code points mapping to
specific glyphs. UTF-8/16/32/shift-JIS/etc are all "encodings".
(Continue reading)

Valeriy E. Ushakov | 20 May 2010 21:07
Picon

Re: wchar_t encoding?

James Chacon <chacon.james <at> gmail.com> wrote:
> On Thu, May 20, 2010 at 8:01 AM, Paul Koning <Paul_Koning <at> dell.com> wrote:
>>> > ...
>>> > The trouble for NetBSD is that it asks iconv to translate to a
>>> character
>>> > set named "wchar_t".  That means "whatever the encoding is for the
>>> > wchar_t data type".  GNU libiconv supports that, so on platforms
>> that
>>> > use that library things are fine.
>>
>> I did some digging to see how libiconv implements that feature.
>>
>> If  __LIBC_ISO_10646__ is defined then it simply aliases this to an
>> appropriate width Unicode (ucs2 or ucs4).  That applies to Linux, for
>> example.
>>
>> If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
>> then it uses that function.  More precisely, a conversion to "wchar_t"
>> first converts to Unicode, which is then fed into mbrtowc to produce the
>> wchar_t encoding.  mbrtowc knows about any locale issues...
>>
>> I guess that means that "multibyte" is Unicode, or UTF-8???  I don't see
>> that documented in any manpage.  It also means that if you have a source
>> character that's not in Unicode but is in whatever encoding wchar_t
>> uses, it would not be handled by the libiconv implementation of iconv()
>> because it uses Unicode as an intermediate form.
> 
> I think part of your problem here is mixing terminology. Unicode is
> not an encoding,

(Continue reading)

Neil Booth | 23 May 2010 05:32
Picon

Re: wchar_t encoding?

Paul Koning wrote:-

> Gents,
> 
> I'm working on a patch to gdb 7.1 to make it work on NetBSD.  The issue
> is that GDB 7 uses iconv to handle character strings, and uses wide
> chars internally so it can handle various non-ASCII scripts.
> 
> The trouble for NetBSD is that it asks iconv to translate to a character
> set named "wchar_t".  That means "whatever the encoding is for the
> wchar_t data type".  GNU libiconv supports that, so on platforms that
> use that library things are fine.
> 
> NetBSD supports iconv, but it doesn't know the "wchar_t" encoding name.
> So I proposed a patch that substitutes what appears to be used instead,
> namely UCS-4 in platform native byte order (so "ucs-4le" on x86, for
> example).  This seems to work.
> 
> The trouble is that I'm getting pushback on the patch, because of
> concerns that the encoding used for wchar_t is not actually UCS-4.  In
> particular, there is this article:
> http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wc
> har_005ft-mess which says that on Solaris and FreeBSD the encoding of
> wchar_t is "undocumented and locale dependent".  (Ye gods!)
> 
> Now, NetBSD is not FreeBSD... so... what is the answer for NetBSD?  Is
> it like FreeBSD?  (If so, it would be good to fix that.)  Or is it a
> fixed encoding, and if so, is it indeed ucs-4?

NetBSD uses citrus.  From what I've figured out, there are 2 wchar_t
(Continue reading)


Gmane