Madhu | 23 May 21:37

utf16 branch

[I've not gotten an answer for this from personal mail so I'm posting
this to the list]

I've noticed a bunch of commits to a new utf16 branch which I assume
is expected to replace the 8bit CMUCL version when it starts working
well.

Is there any specific application for which this is being done?  Are
there any users on this list who will benefit from CMUCL using UTF16
for strings internally?  (and would those applications not be possible
with using a native 8bit lisp?)

Is this a sponsored effort?  Why UTF16?

What are the expected benefits to CMUCL from switching to an internal
unicode encoding?  Is there any benefit from the overhead?

--
Madhu

Raymond Toy (RT/EUS | 28 May 18:41
Favicon

Re: utf16 branch

Madhu wrote:
> [I've not gotten an answer for this from personal mail so I'm posting
> this to the list]
> 
> I've noticed a bunch of commits to a new utf16 branch which I assume
> is expected to replace the 8bit CMUCL version when it starts working
> well.

I believe that is the intent.  But that is  a long way off.  What is 
currently on the branch is support for wide characters and strings, and 
that's about all.  It's currently usable as long as you don't use wide 
characters.  I'm sure there are lots of bugs still lurking as well.

> 
> Is there any specific application for which this is being done?  Are
> there any users on this list who will benefit from CMUCL using UTF16
> for strings internally?  (and would those applications not be possible
> with using a native 8bit lisp?)

I do not know of specific applications, but unicode is one of the two 
most requested features (the other being a 64-bit version).  I don't 
think they'll benefit from utf16 per se, but do assume they will benefit 
with unicode support.

> 
> Is this a sponsored effort?  Why UTF16?

If it is sponsored, please send some sponsoring my way. :-)

I think utf16 is a tradeoff.  It provides "most" of characters without 
(Continue reading)

Fred Gilham | 28 May 19:09
Favicon

Small core broken?

So I tried my usual routine of rebuilding a snapshot to make a smaller 
core.  This makes a significant difference in size---a 25mb core becomes 
a 15mb core in the base build.

Anyway it broke when I tried rebuilding with the first rebuild.  It gave 
a bus error compiling code/macros.lisp.

I think this is worth fixing because the small core build is a good 
regression test for the byte compiler infrastructure.  Unless that is 
being deprecated?

Is there a way to trace compilation to figure out what form caused the 
compilation to break?

--

-- 
Fred Gilham                           gilham <at> ai.sri.com
Time is nature's way of keeping everything from happening
at once.  Unfortunately, it doesn't always work.

Raymond Toy (RT/EUS | 28 May 19:59
Favicon

Re: Small core broken?

Fred Gilham wrote:
> So I tried my usual routine of rebuilding a snapshot to make a smaller 
> core.  This makes a significant difference in size---a 25mb core becomes 
> a 15mb core in the base build.
> 
> Anyway it broke when I tried rebuilding with the first rebuild.  It gave 
> a bus error compiling code/macros.lisp.

No usable backtrace?

> 
> I think this is worth fixing because the small core build is a good 
> regression test for the byte compiler infrastructure.  Unless that is 
> being deprecated?

Yes, it's a good test.  I don't think anyone is deprecating the byte 
compiler.

FWIW, I just did a small build on Linux using current CVS sources. 
Built ok.  (First time I've done that in a loooong time.)

Ray

Alex Goncharov | 29 May 04:19
Favicon

Re: Small core broken?

,--- You/Fred (Wed, 28 May 2008 10:09:57 -0700) ----*
| So I tried my usual routine of rebuilding a snapshot to make a smaller 
| core.  This makes a significant difference in size---a 25mb core becomes 
| a 15mb core in the base build.

Forgive my ignorance -- how do you do that? (I don't see anything
obvious in `src/tools'.)

| Anyway it broke when I tried rebuilding with the first rebuild.  It gave 
| a bus error compiling code/macros.lisp.

On what platform (release included) are you trying this?

Thanks,

-- Alex -- alex-goncharov <at> comcast.net --

/*
 * Diplomacy is to do and say, the nastiest thing in the nicest way.
 * 
 * -- Balfour
 */

Raymond Toy | 29 May 05:05

Re: Small core broken?

Alex Goncharov wrote:
> ,--- You/Fred (Wed, 28 May 2008 10:09:57 -0700) ----*
> | So I tried my usual routine of rebuilding a snapshot to make a smaller 
> | core.  This makes a significant difference in size---a 25mb core becomes 
> | a 15mb core in the base build.
>
> Forgive my ignorance -- how do you do that? (I don't see anything
> obvious in `src/tools'.)
>   

Add :small to *features* when you build.   You can do this by adding it 
to setenv.lisp.

Ray

Madhu | 31 May 07:02

Re: utf16 branch


  |Date: Wed, 28 May 2008 12:41:18 -0400
  |From: "Raymond Toy (RT/EUS)" <raymond.toy <at> ericsson.com>

  |> Is there any specific application for which this is being done?
  |> Are there any users on this list who will benefit from CMUCL
  |> using UTF16 for strings internally?  (and would those
  |> applications not be possible with using a native 8bit lisp?)
  |
  |I do not know of specific applications, but unicode is one of the
  |two most requested features (the other being a 64-bit version).  I
  |don't think they'll benefit from utf16 per se, but do assume they
  |will benefit with unicode support.

I searched the cmucl-imp lists for `unicode' on gmane and did not come
across any specific requests.  Without a specific application and
without specific users in sight, there is the danger of this becoming
an "intellectual exercise" : the parts that worked well will continue
working except with the perfomance pessimised, the parts for which the
system is being changed do not get tested and work poorly and not at
all.  [I'm making the claim based on user experience with unicode
ports of other software]

  |I think utf16 is a tradeoff.  It provides "most" of characters without 
  |bloating memory too much or affecting caches too much.  The resulting 
  |lisp.core is about 5% bigger.

  |> What are the expected benefits to CMUCL from switching to an internal
  |> unicode encoding?  Is there any benefit from the overhead?

(Continue reading)

Raymond Toy | 31 May 15:09

Re: utf16 branch

Madhu wrote:
>   |Date: Wed, 28 May 2008 12:41:18 -0400
>   |From: "Raymond Toy (RT/EUS)" <raymond.toy <at> ericsson.com>
>   
>   |> Is there any specific application for which this is being done?
>   |> Are there any users on this list who will benefit from CMUCL
>   |> using UTF16 for strings internally?  (and would those
>   |> applications not be possible with using a native 8bit lisp?)
>   |
>   |I do not know of specific applications, but unicode is one of the
>   |two most requested features (the other being a 64-bit version).  I
>   |don't think they'll benefit from utf16 per se, but do assume they
>   |will benefit with unicode support.
>
> I searched the cmucl-imp lists for `unicode' on gmane and did not come
> across any specific requests.  Without a specific application and
>   
For whatever reason, I get most of these requests in private.  Granted, 
not a lot, but more than any other feature requests.
> working except with the perfomance pessimised, the parts for which the
> system is being changed do not get tested and work poorly and not at
> all.  [I'm making the claim based on user experience with unicode
> ports of other software]
>
>   
I can guarantee there will be bugs. :-(
> * for bivalent network streams -- like http and https streams
>   (directly accessing fd-stream data)
> * with mmaped files [including backed files] which you can call read-char on
> * with the run-program interface, for sending input/output to `dup'ed fds
(Continue reading)

Carl Shapiro | 31 May 22:09

Re: utf16 branch

On Sat, May 31, 2008 at 6:09 AM, Raymond Toy <toy.raymond <at> gmail.com> wrote:
I believe the only optimization with 8-bit strings was when calling a C function with a string argument.  Before, the address of the string could be sent.  Now we have to convert the 16-bit string to an 8-bit array.  In all other cases, I think some kind of copy had to happen.

Notably, if an application's running time is dominated by the conversion of characters between encodings there probably isn't much value in performing the translation in the first place.  (An obvious exception would be a program whose sole purpose is to transcode character data.)  A user can always read character data straight into byte arrays and pass those byte arrays to C by reference whenever a string is required.
Madhu | 1 Jun 03:33
Favicon

Re: utf16 branch

* "Carl Shapiro"
| Notably, if an application's running time is dominated by the
| conversion of characters between encodings there probably isn't much
| value in performing the translation in the first place.  (An obvious
| exception would be a program whose sole purpose is to transcode
| character data.)  A user can always read character data straight into
| byte arrays and pass those byte arrays to C by reference whenever a
| string is required.

This argument is irrelevant if you were answering the question of the
need for any translation in the first place.  Perhaps you are justifying
your design of switching the implementation to use widechar because the
costs will not affect running time of "most applications".  

The inescapable result of using your utf16 design is that the CMUCL
implementation will be doing copying/translation at every place where
strings are used (this sort of translation copying could be done
OPTIONALLY by the user at the user level).  In an alternative design it
may entirely be possible to avoid forcing the lisp to do the translation
everywhere, but instead give efficient mechanisms for translation to the
user.  Again I found the tradeoff arguments (which I noticed in a CCed
message) that you gave in support of your utf16 design over the
alternative to be bogus.

--
Madhu

Carl Shapiro | 2 Jun 19:13

Re: utf16 branch

On Sat, May 31, 2008 at 6:33 PM, Madhu <enometh <at> meer.net> wrote:
The inescapable result of using your utf16 design is that the CMUCL
implementation will be doing copying/translation at every place where
strings are used (this sort of translation copying could be done

No, it is not.  If you do not want character translation, do not ask for it to be performed.  The resulting character codes may not be correctly recognized by the system but that is no different from what you have now.
 
OPTIONALLY by the user at the user level).  In an alternative design it

This translation cannot be performed optionally.  The Lisp would have no knowledge of the user translated characters (and their properties) and the string and character functions would remain as they are today, ignorant of anything outside of the ASCII repertoire.  That may be fine for what you are doing and other special circumstances but it leaves the rest of the world with no support for their character set.

may entirely be possible to avoid forcing the lisp to do the translation
everywhere, but instead give efficient mechanisms for translation to the
user.  Again I found the tradeoff arguments (which I noticed in a CCed

Nothing precludes this.
 
message) that you gave in support of your utf16 design over the
alternative to be bogus.

I am sorry you feel this way.

Madhu | 3 Jun 00:00

Re: utf16 branch

  |Date: Mon, 2 Jun 2008 10:13:13 -0700
  |From: "Carl Shapiro" <carl.shapiro <at> gmail.com>
  |On Sat, May 31, 2008 at 6:33 PM, Madhu <enometh <at> meer.net> wrote:
  |
  |> The inescapable result of using your utf16 design is that the CMUCL
  |>
  |implementation will be doing copying/translation at every place where
  |> strings are used (this sort of translation copying could be done
  |
  |No, it is not.  If you do not want character translation, do not ask for it
  |to be performed.  The resulting character codes may not be correctly
  |recognized by the system but that is no different from what you have now.

I am missing something.  What is the `system' in question?  

If the system is expecting some encoding, I the user will call a copy
my the string and convert it before passing it to the system.

  |
  |> OPTIONALLY by the user at the user level).  In an alternative design it
  |
  |
  |This translation cannot be performed optionally.  The Lisp would have no
  |knowledge of the user translated characters (and their properties) and the
  |string and character functions would remain as they are today, ignorant of
  |anything outside of the ASCII repertoire.  That may be fine for what you are
  |doing and other special circumstances but it leaves the rest of the world
  |with no support for their character set.

Can you define in terms of the CL API precisely what this support entails?

--
Madhu

Madhu | 3 Jun 00:24

Re: utf16 branch


I suspect I already know the answers to these questions, so it may be
best to wait for the implementation instead

  |From: Madhu <madhu <at> cs.unm.edu>
  |Date: Mon, 02 Jun 2008 16:00:03 -0600
  |
  |  |Date: Mon, 2 Jun 2008 10:13:13 -0700
  |  |From: "Carl Shapiro" <carl.shapiro <at> gmail.com>
  |  |On Sat, May 31, 2008 at 6:33 PM, Madhu <enometh <at> meer.net> wrote:
  |  |
  |  |> The inescapable result of using your utf16 design is that the CMUCL
  |  |>
  |  |implementation will be doing copying/translation at every place where
  |  |> strings are used (this sort of translation copying could be done
  |  |
  |  |No, it is not.  If you do not want character translation, do not
  |  |ask for it to be performed.  The resulting character codes may
  |  |not be correctly recognized by the system but that is no
  |  |different from what you have now.
  |
  |I am missing something.  What is the `system' in question?  
  |
  |If the system is expecting some encoding, I the user will call a copy
  |my the string and convert it before passing it to the system.
  |
  |
  |  |
  |  |> OPTIONALLY by the user at the user level).  In an alternative design it
  |  |
  |  |
  |  |This translation cannot be performed optionally.  The Lisp would
  |  |have no knowledge of the user translated characters (and their
  |  |properties) and the string and character functions would remain
  |  |as they are today, ignorant of anything outside of the ASCII
  |  |repertoire.  That may be fine for what you are doing and other
  |  |special circumstances but it leaves the rest of the world with
  |  |no support for their character set.
  |
  |Can you define in terms of the CL API precisely what this support entails?
  |
  |--
  |Madhu
  |

Carl Shapiro | 3 Jun 00:42

Re: utf16 branch

On Mon, Jun 2, 2008 at 3:00 PM, Madhu <madhu <at> cs.unm.edu> wrote:
I am missing something.  What is the `system' in question?

The Lisp system. 

Can you define in terms of the CL API precisely what this support entails?

See chapters 13 and 12 of the standard.
Carl Shapiro | 3 Jun 00:43

Re: utf16 branch

On Mon, Jun 2, 2008 at 3:42 PM, Carl Shapiro <carl.shapiro <at> gmail.com> wrote:
See chapters 13 and 12 of the standard.

Sorry, 13 and 16. 

Madhu | 3 Jun 00:59

Re: utf16 branch


I had posed the question so I could elaborate my repsonse in specific
context of the reply, but followed up immediately because I
immediately anticipated the style of reply :)

  |> See chapters 13 and 12 of the standard.
  |
  |Sorry, 13 and 16.
  |

I see nothing there that forces introduction of the overhead in the
lisp system, which I am concerned about and have voiced upthread.

In deciding to support unicode (or some other character set) you
should be aware that you have no control over what data the user is
expected to process, and if the user makes no use of the new facility,
there should be no penalty.  However there is a penalty in the current
approach which may be unacceptable for original user

Carl Shapiro | 3 Jun 03:03

Re: utf16 branch

On Mon, Jun 2, 2008 at 3:59 PM, Madhu <madhu <at> cs.unm.edu> wrote:
In deciding to support unicode (or some other character set) you
should be aware that you have no control over what data the user is
expected to process, and if the user makes no use of the new facility,
there should be no penalty.  However there is a penalty in the current
approach which may be unacceptable for original user

Users have, and will always have, the option to read character data as raw 8-bit bytes and decode them into strings or byte arrays any way they see fit.  Moreover, you can and will continue to be able to create your own character stream classes which encapsulate an 8-bit byte stream and circumvent the default behavior of the default character stream.  Users who care about speed and preform no or only light character processing can leave data in 8-bit arrays and pass them to C without any overhead if that is the format their C code expects.

Earlier you had stated that one of your reasons for generally disliking Unicode was that your character set of choice, ISCII, was not fully accommodated by Unicode.  It appears that all of ISCII-91 modulo Annex G is part Unicode today.  As of late April the characters from Annex G are in active technical ballot have code points tentatively assigned to them.  If these are not formally part of Unicode by the time the next release of CMUCL is ready you can add these characters to the Unicode character database on your own and write a trivial ISCII-91 external format that marshals all of the character data, including Annex G, into the proposed code points.  The net result would be a Lisp with first class support for you r character set.

Knowing only what you have told me about your code, it seems like everything you need to process ISCII will be in place.  You will have direct support for the characters in your data and not have to rely on the Lisp being ignorant of 8-bit character codes in the 128..255 range.  If translating your strings to and from UTF-16 turns out to be a performance problem there are many well understood ways to deal with such issues.
Madhu | 3 Jun 12:14

Re: utf16 branch


[I find repeating the same claims, I'll try to stop posting after
 this]

  |> In deciding to support unicode (or some other character set) you
  |> should be aware that you have no control over what data the user
  |> is expected to process, and if the user makes no use of the new
  |> facility, there should be no penalty.  However there is a penalty
  |> in the current approach which may be unacceptable for original
  |> user
  |>
  |
  |Users have, and will always have, the option to read character data
  |as raw 8-bit bytes and decode them into strings or byte arrays any
  |way they see fit.

The whole point of using 8 bit mechanisms is to avoid an additional
encoding/decoding and/or copy step. 

A lot of CMUCL code I've seen takes advantage of CMUCL the
implementation which is now slated to change and which assumes 8 bit
wide base-char.

Now converting to string will double the storage.

I will not be able to used mmap backed strings anymore.

  | Moreover, you can and will continue to be able to create your own
  |character stream classes which encapsulate an 8-bit byte stream and
  |circumvent the default behavior of the default character stream.
  |Users who

I will not be able to subclass fd-stream to provide efficient bivalent
streams like http or https streams.

I will not be able to call run-program and use `dup(2)'ed bivalent
input and output streams anymore, or expect the same speed from it.

All applications where CMUCL sits as a pipe stand to lose.

  |care about speed and preform no or only light character processing
  |can leave data in 8-bit arrays and pass them to C without any
  |overhead if that is the format their C code expects.

Not if my data has to come from a lisp string.  I have to do the
conversion in this case.  

  |Earlier you had stated that one of your reasons for generally
  |disliking Unicode was that your character set of choice, ISCII, was
  |not fully accommodated by Unicode.

When I referred to unicode as a bureaucracy I was talking more of the
regimes it imposes on the programmer.

But my primary concern is with the eager elimination of support for 8
bit strings, and all the advantages it entails, even though the
advantages may sound alien to many.

Especially when I think there is a reasonable implementation strategy
where unicode suport can be added without changing the base-char
implementation.  [This would have been the path followed if there had
been an application to start with.]

There has been a UNICODE branch in CMUCL from early 2000s, using UTF-8
I'm surprised none of the people claiming a unicode CMUCL requirement
checked that out.

Helmut Eller | 3 Jun 23:15

Re: utf16 branch

* Madhu [2008-06-03 12:14+0200] writes:

> But my primary concern is with the eager elimination of support for 8
> bit strings, and all the advantages it entails, even though the
> advantages may sound alien to many.

Isn't the plan to make the internal string representation a (system)
build time choice?  Like the choice of GC algorithm (on some
platforms).

> Especially when I think there is a reasonable implementation strategy
> where unicode suport can be added without changing the base-char
> implementation.  [This would have been the path followed if there had
> been an application to start with.]

ISTR that Duane Rettig (one of the Allegro implementors) said that they
use a single string representation instead of more generic strings with
multiple representations (like SBCL seems to do) because that avoids a
lot of dispatching.  He also said that their customers can choose
between a 8 bit or 16 bit version and that no customer asked for a 32
bit version.  Offering both a 8bit and 16bit version doesn't sound
unreasonable to me.

Helmut.

Carl Shapiro | 4 Jun 00:43

Re: utf16 branch

On Tue, Jun 3, 2008 at 2:15 PM, Helmut Eller <heller <at> common-lisp.net> wrote:
lot of dispatching.  He also said that their customers can choose
between a 8 bit or 16 bit version and that no customer asked for a 32
bit version.  Offering both a 8bit and 16bit version doesn't sound
unreasonable to me.

Allegro has special infrastructure to support the two character widths.  For example, its file compiler will separately compile functions for both character width if it recognizes that a function depends on inline functions that are specialized to one representation or another (char-code, schar, etc.).  The loader will select the correct function for the image it runs in.  This enables the user to load fasl files into the 8-bit and 16-bit images without having to consider an additional dimension of the fasl file provenance.  It would be possible, but would take additional effort, to add this to the CMUCL compiler.

On a vaguely related note, the difference in memory between an 8-bit and 16-bit CMUCL image is +5% which matches the reported difference between an 8-bit and 16-bit ACL image.  

Madhu | 4 Jun 02:35

Re: utf16 branch


  |Date: Tue, 3 Jun 2008 15:43:24 -0700
  |From: "Carl Shapiro" <carl.shapiro <at> gmail.com>
  |
  |On a vaguely related note, the difference in memory between an 8-bit and
  |16-bit CMUCL image is +5% which matches the reported difference between an
  |8-bit and 16-bit ACL image.
  |

Is this the disk size of the core image or ROOM?  On linux x86, I
noticed that core size was 18% larger, fasl files around 15% larger.

On the AMD64 linux at hand I notice I have 24973312 (for 19e
vs. 28082176 (for the 2008-05-25 unicode-utf16-branch), a 12%
increase.

I must caution against this metric -- it depends on the string usage
of the program.  because you cannot control the use of strings in the
users application and the size of the memory would depend directly on
that.  I would like to compare run time memory when I'm processing
gigabytes of character data with about half the data resident in
memory.
--
Madhu

Raymond Toy (RT/EUS | 18 Jun 17:01
Favicon

Re: utf16 branch

Madhu wrote:
>   |Date: Tue, 3 Jun 2008 15:43:24 -0700
>   |From: "Carl Shapiro" <carl.shapiro <at> gmail.com>
>   |
>   |On a vaguely related note, the difference in memory between an 8-bit and
>   |16-bit CMUCL image is +5% which matches the reported difference between an
>   |8-bit and 16-bit ACL image.
>   |
> 
> Is this the disk size of the core image or ROOM?  On linux x86, I
> noticed that core size was 18% larger, fasl files around 15% larger.
> 
> On the AMD64 linux at hand I notice I have 24973312 (for 19e
> vs. 28082176 (for the 2008-05-25 unicode-utf16-branch), a 12%
> increase.
> 
> I must caution against this metric -- it depends on the string usage
> of the program.  because you cannot control the use of strings in the
> users application and the size of the memory would depend directly on
> that.  I would like to compare run time memory when I'm processing
> gigabytes of character data with about half the data resident in
> memory.

Of course, if all you're processing is strings, then the 5% estimate 
will be way off.

In any case, the 8-bit version won't be going a way any time soon.  It 
will be quite a while before the unicode version is complete enough to 
be the default.

Maintaining an 8-bit version isn't really all that difficult, but it 
does mean twice as much work, at least for making sure both 8-bit and 
unicode versions compile.

Ray


Gmane