On Fri, Nov 9, 2012 at 6:15 PM, Andrew Pennebaker <andrew.pennebaker <at> gmail.com>
Frequently when I'm coding in Haskell, the crux of my problem is converting between all the stupid string formats.
You've got String, ByteString, Lazy ByteString, Text, [Word], and on and on... I have to constantly lookup how to convert between them, and the overloaded strings GHC directive doesn't work, and sometimes ByteString.unpack doesn't work, because it expects [Word8], not [Char]. AAAAAAAAAAAAAAAAAAAH!!!
Haskell is a wonderful playground for experimentation. I've started to notice that many Hackage libraries are simply instances of typeclasses designed a while ago, and their underlying implementations are free to play around with various optimizations... But they ideally all expose the same interface through typeclasses.
Can we do the same with String? Can we pick a good compromise of lazy vs strict, flexible vs fast, and all use the same data structure? My vote is for type String = [Char], but I'm willing to switch to another data structure, just as long as it's consistently used.
tl;dr; Use strict Text and ByteStrings.
We need at least two string types, one for byte strings and one for Unicode strings, as these are two semantically different concepts. You see that most modern languages use two types (e.g. str and unicode in Python). For Unicode strings, String is not a good candidate; it's slow, uses a lot of memory, doesn't hide its representation , and finally, it encourages people to do the wrong thing from a Unicode perspective .
As a community we should primary use strict ByteStrings and Texts. There are uses for the lazy variants (i.e. they are sometimes more efficient), but in general the strict versions should be preferred. Choosing to use these two types can sometimes be a bit frustrating, as lots of code (e.g. the base package) uses Strings. But if we don't start using them the pain will never end. One of the main pain points is that the I/O layer using Strings, which is both inconvenient and wrong (e.g. a socket returns bytes, not Unicode code points, yet the recv function returns a String). We really need to create a more sane I/O layer.
If you use ByteString and Text, you shouldn't see calls to pack/unpack in your code (except if you want to interact with legacy code), as the correct way to go between the two is via the encode and decode functions in the text package.
As for type classes, I don't think we use them enough. Perhaps because Haskell wasn't developed as an engineering language, some good software engineering principles (code against an interface, not a concrete implementation) aren't used in out base libraries. One specific example is the lack of a sequence abstraction/type class, that all the string, list, and vector types could implement. Right now all these types try to implement a compatible interface (i.e. the traditional list interface), without a language mechanism to express that this is what they do.
1. If String was designed as an abstract type, we could simply has switched its implementation for a more efficient implementation and we would have to create a new Text type.
2. By having the primary interface of a Unicode data type be a sequence, we encourage users to work on strings element-wise, which can lead to errors as Unicode code points don't correspond well to the human concept of a character (for example, the Swedish ä character can be represented using either one or two code points). A sequence view is sometimes useful, if you're implementing more high-level transformations, but often you should use functions that operate on the whole string, such as toUpper :: Text -> Text.