Andrew Pennebaker | 10 Nov 03:15 2012
Picon

Motion to unify all the string data types

Frequently when I'm coding in Haskell, the crux of my problem is converting between all the stupid string formats.


You've got String, ByteString, Lazy ByteString, Text, [Word], and on and on... I have to constantly lookup how to convert between them, and the overloaded strings GHC directive doesn't work, and sometimes ByteString.unpack doesn't work, because it expects [Word8], not [Char]. AAAAAAAAAAAAAAAAAAAH!!!

Haskell is a wonderful playground for experimentation. I've started to notice that many Hackage libraries are simply instances of typeclasses designed a while ago, and their underlying implementations are free to play around with various optimizations... But they ideally all expose the same interface through typeclasses.

Can we do the same with String? Can we pick a good compromise of lazy vs strict, flexible vs fast, and all use the same data structure? My vote is for type String = [Char], but I'm willing to switch to another data structure, just as long as it's consistently used.

--
Cheers,

Andrew Pennebaker
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Johan Tibell | 10 Nov 04:00 2012
Picon

Re: Motion to unify all the string data types

Hi Andrew,

On Fri, Nov 9, 2012 at 6:15 PM, Andrew Pennebaker <andrew.pennebaker <at> gmail.com> wrote:
Frequently when I'm coding in Haskell, the crux of my problem is converting between all the stupid string formats.

You've got String, ByteString, Lazy ByteString, Text, [Word], and on and on... I have to constantly lookup how to convert between them, and the overloaded strings GHC directive doesn't work, and sometimes ByteString.unpack doesn't work, because it expects [Word8], not [Char]. AAAAAAAAAAAAAAAAAAAH!!!

Haskell is a wonderful playground for experimentation. I've started to notice that many Hackage libraries are simply instances of typeclasses designed a while ago, and their underlying implementations are free to play around with various optimizations... But they ideally all expose the same interface through typeclasses.

Can we do the same with String? Can we pick a good compromise of lazy vs strict, flexible vs fast, and all use the same data structure? My vote is for type String = [Char], but I'm willing to switch to another data structure, just as long as it's consistently used.

tl;dr; Use strict Text and ByteStrings.

We need at least two string types, one for byte strings and one for Unicode strings, as these are two semantically different concepts. You see that most modern languages use two types (e.g. str and unicode in Python). For Unicode strings, String is not a good candidate; it's slow, uses a lot of memory, doesn't hide its representation [1], and finally, it encourages people to do the wrong thing from a Unicode perspective [2].

As a community we should primary use strict ByteStrings and Texts. There are uses for the lazy variants (i.e. they are sometimes more efficient), but in general the strict versions should be preferred. Choosing to use these two types can sometimes be a bit frustrating, as lots of code (e.g. the base package) uses Strings. But if we don't start using them the pain will never end. One of the main pain points is that the I/O layer using Strings, which is both inconvenient and wrong (e.g. a socket returns bytes, not Unicode code points, yet the recv function returns a String). We really need to create a more sane I/O layer.

If you use ByteString and Text, you shouldn't see calls to pack/unpack in your code (except if you want to interact with legacy code), as the correct way to go between the two is via the encode and decode functions in the text package.

As for type classes, I don't think we use them enough. Perhaps because Haskell wasn't developed as an engineering language, some good software engineering principles (code against an interface, not a concrete implementation) aren't used in out base libraries. One specific example is the lack of a sequence abstraction/type class, that all the string, list, and vector types could implement. Right now all these types try to implement a compatible interface (i.e. the traditional list interface), without a language mechanism to express that this is what they do.

1. If String was designed as an abstract type, we could simply has switched its implementation for a more efficient implementation and we would have to create a new Text type.

2. By having the primary interface of a Unicode data type be a sequence, we encourage users to work on strings element-wise, which can lead to errors as Unicode code points don't correspond well to the human concept of a character (for example, the Swedish ä character can be represented using either one or two code points). A sequence view is sometimes useful, if you're implementing more high-level transformations, but often you should use functions that operate on the whole string, such as toUpper :: Text -> Text.

Cheers,
  Johan

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Roman Cheplyaka | 10 Nov 07:22 2012

Re: Motion to unify all the string data types

* Johan Tibell <johan.tibell <at> gmail.com> [2012-11-09 19:00:04-0800]
> As a community we should primary use strict ByteStrings and Texts. There
> are uses for the lazy variants (i.e. they are sometimes more efficient),
> but in general the strict versions should be preferred.

I'm fairly surprised by this advice.

I think that lazy BS/Text are a much safer default.

If there's not much text it wouldn't matter anyway, but for large
amounts using strict BS/Text would disable incremental
producing/consuming (except when you're using some kind of an iteratee
library).

Can you explain your reasoning?

Roman
Johan Tibell | 10 Nov 17:57 2012
Picon

Re: Motion to unify all the string data types

On Fri, Nov 9, 2012 at 10:22 PM, Roman Cheplyaka <roma <at> ro-che.info> wrote:

* Johan Tibell <johan.tibell <at> gmail.com> [2012-11-09 19:00:04-0800]
> As a community we should primary use strict ByteStrings and Texts. There
> are uses for the lazy variants (i.e. they are sometimes more efficient),
> but in general the strict versions should be preferred.

I'm fairly surprised by this advice.

I think that lazy BS/Text are a much safer default.

If there's not much text it wouldn't matter anyway, but for large
amounts using strict BS/Text would disable incremental
producing/consuming (except when you're using some kind of an iteratee
library).

Can you explain your reasoning?

It better communicates intent. A e.g. lazy byte string can be used for two separate things:

 * to model a stream of bytes, or
 * to avoid costs due to concatenating strings.

By using a strict byte string you make it clear that you're not trying to do the former (at some potential cost due to the latter). When you want to do the former it should be clear to the consumer that he/she better consume the string in an incremental manner as to preserve laziness and avoid space leaks (by forcing the whole string).

-- Johan

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Bas van Dijk | 12 Nov 08:52 2012
Picon

Re: Motion to unify all the string data types

On 10 November 2012 17:57, Johan Tibell <johan.tibell <at> gmail.com> wrote:
> It better communicates intent. A e.g. lazy byte string can be used for two
> separate things:
>
>  * to model a stream of bytes, or
>  * to avoid costs due to concatenating strings.
>
> By using a strict byte string you make it clear that you're not trying to do
> the former (at some potential cost due to the latter). When you want to do
> the former it should be clear to the consumer that he/she better consume the
> string in an incremental manner as to preserve laziness and avoid space
> leaks (by forcing the whole string).

Good advice.

And when you want to do the latter you should use a Builder[1] (or [2]
if you're working with text).

Bas

[1] http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html
[2] http://hackage.haskell.org/packages/archive/text/0.11.2.3/doc/html/Data-Text-Lazy-Builder.html
Gábor Lehel | 10 Nov 13:37 2012
Picon

Re: Motion to unify all the string data types

On Sat, Nov 10, 2012 at 4:00 AM, Johan Tibell <johan.tibell <at> gmail.com> wrote:
> As for type classes, I don't think we use them enough. Perhaps because
> Haskell wasn't developed as an engineering language, some good software
> engineering principles (code against an interface, not a concrete
> implementation) aren't used in out base libraries. One specific example is
> the lack of a sequence abstraction/type class, that all the string, list,
> and vector types could implement. Right now all these types try to implement
> a compatible interface (i.e. the traditional list interface), without a
> language mechanism to express that this is what they do.

I think the challenge is designing an abstraction that everyone is
comfortable with. If you just make everything a class method
(ListLike), it's ugly. If you don't, how do you figure out what goes
in the class and what gets implemented on top of it? Is there any
principled reason for it, or is it just ad hoc? How do you make sure
that none of the implementations suffers a performance decrease? What
about sequential vs. random access (list vs. array) issues? Should an
interface be implemented if it's semantically reasonable, but slow? If
you treat everything as a uniform sequence, doesn't that bring back
the Unicode issues again? (And can you make it work for all of
Text/ByteString (kind *), boxed Vectors and lists (* -> *), and
unboxed vectors (* -> * with a constraint)? What about operations that
change the element type? Surely it's possible with TypeFamilies,
ConstraintKinds, and PolyKinds all available, but I'm not sure if it's
obvious. Can it go into the Prelude if it uses extensions? Should it
also support other containers, like Maps? And so on.)

So my impression is that the reason the problem hasn't been solved yet
is that it's hard. We do have some useful things: Functor, Foldable,
Traversable, and the classes in Data.Key[1], but for starters none of
them can be implemented by Text and ByteString, so that brings us back
to square one.

But a constructive idea: what if strict Text and ByteString were both
synonyms for unboxed Vectors (already available in ByteString's
case[2])? What if, for lazy Text and ByteString, we either had lazy
Vectors to make them synonyms of, or a 'data Lazy v' which made a lazy
chunked sequence out of any underlying strict Vector-ish type? That
would cut down on the number of types, which is a good thing in
itself, and it would suggest an obvious way to abstract over them: the
existing Functor/Foldable/Traversable/Data.Key classes extended with
an associated constraint. I'm not sure how much of the use cases that
would cover, but certainly a lot more than we have now. It wouldn't
solve every one of the questions above, but it anwers many of them,
and it seems like a good compromise. The big drawbacks I can see are
that (a) it would be a *lot* of work, especially if we want to be
completely uncompromising on performance, and (b) I'm not sure how
pinned arrays and interoperation with C would be handled without
making it complicated again. (Though I suppose we could just punt and
have ByteString be a synonym for Vector.Storable (pinned) and Text for
Vector.Unboxed (not pinned) to mirror the current situation. Or maybe
we could have a pinArray# primop?)

Anyway, if I'm blue-sky dreaming, that's what looks appealing to me.

[1] http://hackage.haskell.org/packages/archive/keys/3.0.1/doc/html/Data-Key.html
[2] http://hackage.haskell.org/package/vector-bytestring

--

-- 
Your ship was destroyed in a monadic eruption.
Alberto G. Corona | 10 Nov 15:16 2012
Picon

Re: Motion to unify all the string data types

Andrew:

There is a ListLike package, which does this nice abstraction. but I don't know if it is ready for and/or enough complete for serious usage.
I´m thinking into using it for the same reasons.

Anyone has some experiences to share about it?


2012/11/10 Andrew Pennebaker <andrew.pennebaker <at> gmail.com>
Frequently when I'm coding in Haskell, the crux of my problem is converting between all the stupid string formats.

You've got String, ByteString, Lazy ByteString, Text, [Word], and on and on... I have to constantly lookup how to convert between them, and the overloaded strings GHC directive doesn't work, and sometimes ByteString.unpack doesn't work, because it expects [Word8], not [Char]. AAAAAAAAAAAAAAAAAAAH!!!

Haskell is a wonderful playground for experimentation. I've started to notice that many Hackage libraries are simply instances of typeclasses designed a while ago, and their underlying implementations are free to play around with various optimizations... But they ideally all expose the same interface through typeclasses.

Can we do the same with String? Can we pick a good compromise of lazy vs strict, flexible vs fast, and all use the same data structure? My vote is for type String = [Char], but I'm willing to switch to another data structure, just as long as it's consistently used.

--
Cheers,

Andrew Pennebaker

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe




--
Alberto.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Francesco Mazzoli | 10 Nov 16:41 2012
Picon

Re: Motion to unify all the string data types

At Sat, 10 Nov 2012 15:16:30 +0100,
Alberto G. Corona  wrote:
> There is a ListLike package, which does this nice abstraction. but I don't
> know if it is ready for and/or enough complete for serious usage.  I´m
> thinking into using it for the same reasons.
> 
> Anyone has some experiences to share about it?

I've used it in the past and it's solid, it's been around for a while and the
original author knows his Haskell.

Things I don't like:

* The classes are huge:
  <http://hackage.haskell.org/packages/archive/ListLike/3.1.6/doc/html/Data-ListLike.html#t:ListLike>.
  I'd much rater prefer to have all those utilities functions outside the type
  class, for no particular reason other then the ugliness of the type class.

* It defines its own wrappers for `ByteString':
  <http://hackage.haskell.org/packages/archive/ListLike/3.1.6/doc/html/Data-ListLike.html#t:CharString>.

* It doesn't have instances for `Text', you have to resort to the
  `listlike-instances' package.

In any case I think it's on the right track, I'd really like something like
that, but much simpler, to be in `base'.

Francesco

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
John Lato | 12 Nov 04:21 2012
Picon

Re: Motion to unify all the string data types

From: Francesco Mazzoli <f <at> mazzo.li>

At Sat, 10 Nov 2012 15:16:30 +0100,
Alberto G. Corona  wrote:
> There is a ListLike package, which does this nice abstraction. but I don't
> know if it is ready for and/or enough complete for serious usage.  I?m
> thinking into using it for the same reasons.
>
> Anyone has some experiences to share about it?

I've used it in the past and it's solid, it's been around for a while and the
original author knows his Haskell.

Things I don't like:

* The classes are huge:
  <http://hackage.haskell.org/packages/archive/ListLike/3.1.6/doc/html/Data-ListLike.html#t:ListLike>.
  I'd much rater prefer to have all those utilities functions outside the type
  class, for no particular reason other then the ugliness of the type class.

Speaking as the ListLike maintainer, I'd like this too.  But it's difficult to do so without sacrificing performance.  In some cases, sacrificing *a lot* of performance.  So they have to be class members.

However, there's no reason ListLike has to remain a single monolithic class.  I'd prefer an API that's split up into several classes, as was done in Edison.  Then 'ListLike' itself would just be a type synonym, or possibly a small type class with the appropriate superclasses.

However this seems like a lot of work for relatively little payoff, which makes it a low priority for me.

* It defines its own wrappers for `ByteString':
  <http://hackage.haskell.org/packages/archive/ListLike/3.1.6/doc/html/Data-ListLike.html#t:CharString>.

The community's view on newtypes is funny.  On the one hand, I see all the time the claim "Just use a newtype wrapper to write instances for ..." (e.g. the recent suggestion of 'instance Num a => Num (a,a)'.  On the other, nobody actually seems to want to use these newtype wrappers.  Maybe it clutters the code?  I don't know.

I couldn't think of a better way to implement this functionality, patches would be gratefully accepted.  Anyway, you really shouldn't use these wrappers unless you're using a ByteString to represent ASCII text.  Which you shouldn't be doing anyway.  If you're using a ByteString to represent a sequence of bytes, you needn't ever encounter CharString.


* It doesn't have instances for `Text', you have to resort to the
  `listlike-instances' package.

Given that text and vector are both in the Haskell Platform, I wouldn't object to these instances being rolled into the main ListLike package.  Any comments on this?

John L. 
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Francesco Mazzoli | 12 Nov 11:26 2012
Picon

Re: Motion to unify all the string data types

At Mon, 12 Nov 2012 11:21:42 +0800,
John Lato wrote:
> Speaking as the ListLike maintainer, I'd like this too.  But it's difficult to
> do so without sacrificing performance.  In some cases, sacrificing *a lot* of
> performance.  So they have to be class members.
> 
> However, there's no reason ListLike has to remain a single monolithic class.
> I'd prefer an API that's split up into several classes, as was done in Edison.
> Then 'ListLike' itself would just be a type synonym, or possibly a small type
> class with the appropriate superclasses.

Interesting.  Are we sure that we can't convince GHC to inline the functions
with enough pragmas?

> However this seems like a lot of work for relatively little payoff, which
> makes it a low priority for me.

Fair enough.

> The community's view on newtypes is funny.  On the one hand, I see all the
> time the claim "Just use a newtype wrapper to write instances for ..."
> (e.g. the recent suggestion of 'instance Num a => Num (a,a)'.  On the other,
> nobody actually seems to want to use these newtype wrappers.  Maybe it
> clutters the code?  I don't know.
> 
> I couldn't think of a better way to implement this functionality, patches
> would be gratefully accepted.  Anyway, you really shouldn't use these wrappers
> unless you're using a ByteString to represent ASCII text.  Which you shouldn't
> be doing anyway.  If you're using a ByteString to represent a sequence of
> bytes, you needn't ever encounter CharString.

Well newtypes are good, the problem is that either you use well accepted ones
(e.g. the `Sum' and `Product' in base) or otherwise it's not worth it, because
people are going to unpack them and use their owns.  What I would do is simply
define those instances in separate modules.

> Given that text and vector are both in the Haskell Platform, I wouldn't object
> to these instances being rolled into the main ListLike package.  Any comments
> on this?

I think it's much better, especially for Text, since if you use ListLike you are
probably using it with Text (at least in my experience).  Not a big deal anyway.

Francesco.
Francesco Mazzoli | 12 Nov 11:31 2012
Picon

Re: Motion to unify all the string data types

At Mon, 12 Nov 2012 10:26:01 +0000,
Francesco Mazzoli wrote:
> Interesting.  Are we sure that we can't convince GHC to inline the functions
> with enough pragmas?

Inline and SPECIALIZE :).

Francesco.

Gmane