Mauricio | 2 Dec 13:41
Favicon

Sugestion for a basic Utf8 type.

Hi,

I would like to sugest a new basic type in Haskell. What if we had
something like this (with any other quoting character):

«Je ne parle pas français. Meu nome é Maurício. ¿Hablas español?»

This would  be of type  Utf8. I  think now it  is not a  bad idea,
since Haskell source  code is supposed to be  utf-8.  The internal
representation of  this datatype would be a  null terminated utf-8
byte vector. No standard operations would be defined on that type,
i.e., it would be  a “communication standard” between everybody,
but module  writers could develop  different basic usage  based on
operations on them  using Foreign.  (I think it  would be dificult
to set default operations, since  there are so many things you can
do with utf-8.)

Pros:

  * There  would  be  no  doubt   you  can  use  utf-8  when  using
    this, since there's no conversion involved.

  * Cleaner code  on utf8 operations,  maybe.  There are  many utf8
    modules today with different goals in mind, I thing it would be
    nice if  they could share this  common basic type  and a common
    underline implementation.

Cons:

  * Probably, many. I have no deep understanding of Haskell.
(Continue reading)

Jason Dusek | 2 Dec 17:16

Re: Sugestion for a basic Utf8 type.

  Unlike native Strings, this would have the potential for a
  runtime parse error at every character.

--
_jsn
Bayley, Alistair | 2 Dec 14:32

RE: Sugestion for a basic Utf8 type.

> From: haskell-cafe-bounces <at> haskell.org 
> [mailto:haskell-cafe-bounces <at> haskell.org] On Behalf Of Mauricio
> 
> I would like to sugest a new basic type in Haskell. What if we had
> something like this (with any other quoting character):
> 
> «Je ne parle pas français. Meu nome é Maurício. ¿Hablas español?»
> 
> This would  be of type  Utf8. I  think now it  is not a  bad idea,
> since Haskell source  code is supposed to be  utf-8.  The internal
> representation of  this datatype would be a  null terminated utf-8
> byte vector. ...

Stream fusion on Haskell Unicode strings - Tom Harper
http://www.wellquite.org/non-blog/AngloHaskell2008/tom%20harper.pdf

I don't know what it's status is. The original implementation used UTF16 rather than UTF8.

Alistair
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************
(Continue reading)

Mauricio | 2 Dec 17:50
Favicon

Re: Sugestion for a basic Utf8 type.

 >> I would like to sugest a new basic type in Haskell. What if we had
 >> something like this (with any other quoting character):
 >>
 >> «Je ne parle pas français. (...) ¿Hablas español?»
 >>
 >> This would  be of type  Utf8. I  think now it  is not a  bad idea,
 >> since Haskell source  code is supposed to be  utf-8.  The internal
 >> representation of  this datatype would be a  null terminated utf-8
 >> byte vector. ...

 > Stream fusion on Haskell Unicode strings - Tom Harper
 > http://www.wellquite.org/non-blog/AngloHaskell2008/tom%20harper.pdf
 > (...)

Actually, what  I suggest is quite  different, in points  I see as
worthwhile:

* His focus  is on speed and  memory, my goal is  more elegant and
   safe code.

* His approach  consolidates Prelude. My  approach allows complete
   elimination of  Prelude. If we had  a Utf8 basic  type, we could
   have modules with many different basic types, and many different
   ideas on how to 'read «something» :: <sometype>'. In the future,
   we  could write  a  module to  implement  some sort  of not  yet
   invented  numeral type,  which other  module would  allow  to be
   readed from Chinese kanji.

* He wants  to preserve  many properties of  [Char]. I  think Utf8
   type  should  have  no  standard  properties at  all.  See  next
(Continue reading)

Jason Dusek | 2 Dec 18:27

Re: Re: Sugestion for a basic Utf8 type.

  So this proposal is more than a UTF8 type, since it
  encompasses a move away from text as lists. What interfaces
  would we have to text in this proposal?

--
_jsn
Mauricio | 2 Dec 19:17
Favicon

Re: Sugestion for a basic Utf8 type.

 >   So  this  proposal  is  more   than  a  UTF8  type,  since  it
 >   encompasses a  move away from  text as lists.  What interfaces
 >   would we have to text in this proposal?
 >

Normal users  would import modules with  specific interfaces, like
functions or instances.

One possible such  module would be Streams like  those sugested in
the  previous article.  Others could  offer functionality  I don't
know of  -- maybe there's  some usefull interface for  japanese or
greek users we (non japanese or greek) don't imagine.

My first attempt  would be PortugueseText, with a  type that could
only be built after Portuguese "primitives" or read from Utf8 with
possible errors,  and convert to  Utf8 of course. That  type would
always convert  to Utf8 with  correct diacriticals, and  sort with
the  latest Portuguese  agreements. Mapping  over syllables  could
be  allowed,  that  makes  sense in  syllabic  languages.  Quotes,
questions,  parenthesis etc.  could  be done  with functions  like
'quote «Ser ou não ser»'.

Other could be SimpleEnglishTextAsList, that could offer something
close  to what  we  have today,  with  functions for  uppercasing,
lowercasing and well behaved (non ambiguous) sorting.

Writers  of very  basic modules  would  have to  touch Utf8  using
Foreign.  So,  maybe  the  only  standard  interface  would  be  a
(ForeignPtr?) pointer to  a null terminated block  of memory. This
would make Foreign a new Prelude,  maybe. In the end, this is just
(Continue reading)


Gmane