Ganesh Sittampalam | 11 Sep 00:22 2012
Picon

HTTP and character encodings

Hi,

tl;dr: I'd like to remove the String instances from the HTTP package.

The HTTP library is overloaded on the type for request and response
bodies; there are instances for String and both strict and lazy Bytestrings.

Unfortunately, the String instance is rather broken. A String ought to
represent Unicode data, but the HTTP wire format is bytes, and HTTP
makes no attempt to handle encoding.

In particular uploaded data (e.g. in POSTs) gets silently truncated and
downloaded data is improperly embedded as one byte per character no
matter what encoding the server advertises in the Content-Type header.
(https://github.com/haskell/HTTP/issues/28)

I've spent a while investigating the option of making HTTP encode and
decode Strings appropriately, but my tentative conclusion is that it's
too hard:

- on upload we'd have to pick an encoding by default - probably UTF-8 -
and also add it to the Content-Type header which may involve messing
with any header supplied by the user. If the user supplied a different
encoding in Content-Type then we probably would need to notice and
respect that.

- on upload Content-Length may also need to be managed somehow.

- on download we'd need to be able to handle at least common encodings
that the server might send, but on Windows even common encodings like
(Continue reading)

Bryan O'Sullivan | 11 Sep 00:44 2012

Re: HTTP and character encodings

On Mon, Sep 10, 2012 at 3:22 PM, Ganesh Sittampalam <ganesh <at> earth.li> wrote:

I imagine this could be quite disruptive, but on the other hand people
using the String instance are getting silently broken behaviour and a
couple of people have been bitten by this recently.

I'm in favour of broken code breaking explicitly, rather than silently doing the wrong thing, so +1 to nuking the String instances in spite of the up-front pain.
_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Erik Hesselink | 11 Sep 09:43 2012
Picon

Re: HTTP and character encodings

On Tue, Sep 11, 2012 at 12:44 AM, Bryan O'Sullivan <bos <at> serpentine.com> wrote:
> On Mon, Sep 10, 2012 at 3:22 PM, Ganesh Sittampalam <ganesh <at> earth.li> wrote:
>>
>> I imagine this could be quite disruptive, but on the other hand people
>> using the String instance are getting silently broken behaviour and a
>> couple of people have been bitten by this recently.
>
> I'm in favour of broken code breaking explicitly, rather than silently doing
> the wrong thing, so +1 to nuking the String instances in spite of the
> up-front pain.

+1. I've been bitten by the broken String instances, and now only use
the ByteString instances.

Erik
Henning Thielemann | 11 Sep 09:59 2012
Picon

Re: HTTP and character encodings


On Mon, 10 Sep 2012, Ganesh Sittampalam wrote:

> tl;dr: I'd like to remove the String instances from the HTTP package.
>
> The HTTP library is overloaded on the type for request and response
> bodies; there are instances for String and both strict and lazy Bytestrings.

This instance was also kind of broken, because it used 
TypeSynonymInstances without need.
Christian Maeder | 11 Sep 10:30 2012
Picon

Re: HTTP and character encodings

Am 11.09.2012 00:22, schrieb Ganesh Sittampalam:
> Hi,
>
> tl;dr: I'd like to remove the String instances from the HTTP package.
>
> The HTTP library is overloaded on the type for request and response
> bodies; there are instances for String and both strict and lazy Bytestrings.
>
> Unfortunately, the String instance is rather broken. A String ought to
> represent Unicode data, but the HTTP wire format is bytes, and HTTP
> makes no attempt to handle encoding.

if you remove the String instance I would need to encode my strings 
manually (and maybe worse than it is done now).

Which instance does the package cabal-install use?

Which alternative (better maintained) packages could I use if I have to 
change my code anyway?

The header of Network.HTTP contains a "Portability" saying	"non-portable 
(not tested)", but the package contains a test-suite.
Are tests (or their lack) a portability issue?

(I've seen packages claiming portability with plenty of ghc extensions, 
that probably only work for a certain ghc versions on few architectures.)

Cheers Christian

>
> In particular uploaded data (e.g. in POSTs) gets silently truncated and
> downloaded data is improperly embedded as one byte per character no
> matter what encoding the server advertises in the Content-Type header.
> (https://github.com/haskell/HTTP/issues/28)
>
> I've spent a while investigating the option of making HTTP encode and
> decode Strings appropriately, but my tentative conclusion is that it's
> too hard:
>
> - on upload we'd have to pick an encoding by default - probably UTF-8 -
> and also add it to the Content-Type header which may involve messing
> with any header supplied by the user. If the user supplied a different
> encoding in Content-Type then we probably would need to notice and
> respect that.
>
> - on upload Content-Length may also need to be managed somehow.
>
> - on download we'd need to be able to handle at least common encodings
> that the server might send, but on Windows even common encodings like
> iso-8859-* don't exist and there aren't always appropriate substitutes.
>
> - on download we'd also really want to parse HTML/XML documents looking
> for in-document specifications of the encoding  in META tags and XML
> declarations (see http://www.w3.org/QA/2008/03/html-charset.html)
>
> - we'd need to also parse Content-Type to detect when the data is
> supposed to be binary, and then check that it is actually 8-bit clean on
> upload. If the user doesn't supply Content-Type at all, then what?
>
> I think the right way to do this would be to have proper high-level and
> low-level APIs where only the high-level API supports strings but also
> does a lot more active management of standard HTTP headers like
> content-type/content-length. But HTTP as it stands is a long way from
> doing that and a short-term fix is needed.
>
> So I'm reluctantly drawn to the conclusion that the only reasonable
> thing to do is to remove the String instances from HTTP completely for now.
>
> I imagine this could be quite disruptive, but on the other hand people
> using the String instance are getting silently broken behaviour and a
> couple of people have been bitten by this recently.
>
> Any thoughts?
>
> Cheers,
>
> Ganesh
>
Ben Millwood | 11 Sep 15:06 2012
Picon

Re: HTTP and character encodings

On Tue, Sep 11, 2012 at 9:30 AM, Christian Maeder
<Christian.Maeder <at> dfki.de> wrote:
> if you remove the String instance I would need to encode my strings manually
> (and maybe worse than it is done now).

This isn't actually that hard, and particularly it would be easy to do
a better job than the current one if you used a real encoding package
like text or utf8-string.

> Which instance does the package cabal-install use?

Looks like it uses both String and ByteString in various pieces of the
code. But it would probably be a sensible idea to switch to ByteString
anyway.

> Which alternative (better maintained) packages could I use if I have to
> change my code anyway?
>
> The header of Network.HTTP contains a "Portability" saying
> "non-portable (not tested)", but the package contains a test-suite.
> Are tests (or their lack) a portability issue?

There is no standardised meaning of the Portability field, as far as I
know, so it's probably best to ignore this.

Yours,
Ben
Ganesh Sittampalam | 11 Sep 19:38 2012
Picon

Re: HTTP and character encodings

On 11/09/2012 09:30, Christian Maeder wrote:
> Am 11.09.2012 00:22, schrieb Ganesh Sittampalam:
>> Hi,
>>
>> tl;dr: I'd like to remove the String instances from the HTTP package.
>>
>> The HTTP library is overloaded on the type for request and response
>> bodies; there are instances for String and both strict and lazy
>> Bytestrings.
>>
>> Unfortunately, the String instance is rather broken. A String ought to
>> represent Unicode data, but the HTTP wire format is bytes, and HTTP
>> makes no attempt to handle encoding.
> 
> if you remove the String instance I would need to encode my strings
> manually (and maybe worse than it is done now).

The obvious way to encode them is to use ByteString.Char8.pack which is
exactly what HTTP does now. I can't really think of anything worse that
someone might do by accident.

> Which alternative (better maintained) packages could I use if I have to
> change my code anyway?

There's http-conduit, which also doesn't support String, but does
support https and has a much cleaner interface. If conduit ever made it
into the Platform then it would be an obvious choice to replace HTTP;
but I still have some faith in lazy IO which is one of the reasons why I
put effort into the HTTP package.

Cheers,

Ganesh
Ben Millwood | 11 Sep 20:27 2012
Picon

Re: HTTP and character encodings

On Tue, Sep 11, 2012 at 6:38 PM, Ganesh Sittampalam <ganesh <at> earth.li> wrote:
> There's http-conduit, which also doesn't support String, but does
> support https and has a much cleaner interface. If conduit ever made it
> into the Platform then it would be an obvious choice to replace HTTP;
> but I still have some faith in lazy IO which is one of the reasons why I
> put effort into the HTTP package.

As an aside, the major reason I support HTTP over something like
http-conduit is the latter's titanic dependency list. I think
especially as a dependency of cabal-install that's something of a
dealbreaker:

$ cabal install http-conduit | grep 'new package' | wc -l
[...]
47
Vincent Hanquez | 11 Sep 22:36 2012

Re: HTTP and character encodings

On 09/11/2012 07:27 PM, Ben Millwood wrote:
> As an aside, the major reason I support HTTP over something like
> http-conduit is the latter's titanic dependency list. I think
> especially as a dependency of cabal-install that's something of a
> dealbreaker:
>
> $ cabal install http-conduit | grep 'new package' | wc -l
> [...]
> 47
I'm not sure what do you want to demonstrate here, number of packages couldn't 
be a more irrelevant metric. Would you prefer a package that includes everything 
in one giant codebase ?

For example, the whole haskell TLS stack is responsible for at least 10~15 
packages in http-conduit's list. I could easily put everything in one giant 
package, openssl style. However i think it make more sense to build bricks 
(asn1, crypto hashes, ..) that can be reused in different libraries/programs 
(and indeed they are).

Now cabal-install is a bit of a special case, and keeping HTTP working is 
probably a good idea. But at the Platform level, while i agree the amount of 
work required is huge and not without controversies, keeping HTTP instead of 
http-conduit just make it likely the platform will be (is) irrelevant for many 
people.

--

-- 
Vincent
Henning Thielemann | 11 Sep 22:45 2012
Picon

Re: HTTP and character encodings


On Tue, 11 Sep 2012, Vincent Hanquez wrote:

> On 09/11/2012 07:27 PM, Ben Millwood wrote:
>> As an aside, the major reason I support HTTP over something like
>> http-conduit is the latter's titanic dependency list. I think
>> especially as a dependency of cabal-install that's something of a
>> dealbreaker:
>> 
>> $ cabal install http-conduit | grep 'new package' | wc -l
>> [...]
>> 47
>
> I'm not sure what do you want to demonstrate here, number of packages 
> couldn't be a more irrelevant metric. Would you prefer a package that 
> includes everything in one giant codebase ?

I also hesitate to depend on packages with very many dependencies - 
although I write such packages myself. Chances are high that one of the 
imported packages fails to compile on a certain system or compiler 
version.

> For example, the whole haskell TLS stack is responsible for at least 10~15 
> packages in http-conduit's list.

I haven't checked whether it is possible, but maybe there are ways to let 
the user plug in the TLS functionality if he needs it. Then conduit-http 
would not need to depend on it.
Vincent Hanquez | 12 Sep 00:00 2012

Re: HTTP and character encodings

On 09/11/2012 09:45 PM, Henning Thielemann wrote:
> I also hesitate to depend on packages with very many dependencies - although I 
> write such packages myself. Chances are high that one of the imported packages 
> fails to compile on a certain system or compiler version.
>

I suppose if more stuff were in the platform, that would make it less likely though.

>
>> For example, the whole haskell TLS stack is responsible for at least 10~15 
>> packages in http-conduit's list.
>
> I haven't checked whether it is possible, but maybe there are ways to let the 
> user plug in the TLS functionality if he needs it. Then conduit-http would not 
> need to depend on it.

I think it depends on how much control you need on the stack. For simple use, 
"open a TLS socket and give me a raw bytestream", it's possible. I believe 
that's how people use HTTP with HsOpenSSL. I think a great deal of possibilities 
is lost with this approach.

--

-- 
Vincent
Ian Lynagh | 11 Sep 23:00 2012
Picon

Re: HTTP and character encodings

On Tue, Sep 11, 2012 at 09:36:37PM +0100, Vincent Hanquez wrote:
> 
> keeping HTTP instead of http-conduit just make it
> likely the platform will be (is) irrelevant for many people.

Are you saying that http-conduit is better, more popular, or both, than
HTTP?

According to
    http://packdeps.haskellers.com/reverse/http-conduit
    http://packdeps.haskellers.com/reverse/HTTP
http-conduit has 33 reverse-deps to HTTP's 139, although these
measurements are somewhat flawed as they don't take into account whether
some of HTTP's rev-deps are old packages that have been abandoned, and
all things being equal you'd expect HTTP to have more users as it's part
of the HP.

Thanks
Ian
Vincent Hanquez | 12 Sep 00:30 2012

Re: HTTP and character encodings

On 09/11/2012 10:00 PM, Ian Lynagh wrote:
> On Tue, Sep 11, 2012 at 09:36:37PM +0100, Vincent Hanquez wrote:
>> keeping HTTP instead of http-conduit just make it
>> likely the platform will be (is) irrelevant for many people.
> Are you saying that http-conduit is better, more popular, or both, than
> HTTP?
It's hard to get any solid and comparable numbers here, HTTP is a much older 
package and it's part of the HP.

I do however think it's currently more popular (yesod, enumerator/conduit, etc.) 
and more featureful than HTTP.

HTTP is probably doing a fine job for lots of users, as long as they are happy 
with lazy io and no https without jumping through hoops, however http-conduit is 
providing a superset of this, with bonus of stream io, https, socks, and more.

--

-- 
Vincent
Ben Millwood | 11 Sep 23:29 2012
Picon

Re: HTTP and character encodings

On Tue, Sep 11, 2012 at 9:36 PM, Vincent Hanquez <tab <at> snarc.org> wrote:
> On 09/11/2012 07:27 PM, Ben Millwood wrote:
>>
>> As an aside, the major reason I support HTTP over something like
>> http-conduit is the latter's titanic dependency list. I think
>> especially as a dependency of cabal-install that's something of a
>> dealbreaker:
>>
>> $ cabal install http-conduit | grep 'new package' | wc -l
>> [...]
>> 47
>
> I'm not sure what do you want to demonstrate here, number of packages
> couldn't be a more irrelevant metric. Would you prefer a package that
> includes everything in one giant codebase ?

Sorry, I realise in retrospect that my original message was
misleading, I should have been more clear: http-conduit is a great
package and I would recommend it for /most/ HTTP applications. But
there is virtue to having an /alternative/ that is much less capable
but much less heavyweight in terms of the things it needs. Especially
since some people will want to install cabal-install without already
/having/ cabal-install, using the bootstrap script, which needs to
manually download and install the entire transitive dependency list.
Imagine if that was 47 packages!

The way that http-conduit is designed and built is definitely correct.
It should be in many small packages so that it can be reused. But HTTP
with a lightweight dependency list also has its place.
Ganesh Sittampalam | 12 Sep 00:08 2012
Picon

Re: HTTP and character encodings

On 11/09/2012 19:27, Ben Millwood wrote:
> On Tue, Sep 11, 2012 at 6:38 PM, Ganesh Sittampalam <ganesh <at> earth.li> wrote:
>> There's http-conduit, which also doesn't support String, but does
>> support https and has a much cleaner interface. If conduit ever made it
>> into the Platform then it would be an obvious choice to replace HTTP;
>> but I still have some faith in lazy IO which is one of the reasons why I
>> put effort into the HTTP package.
> 
> As an aside, the major reason I support HTTP over something like
> http-conduit is the latter's titanic dependency list. I think
> especially as a dependency of cabal-install that's something of a
> dealbreaker:

For what it's worth this this isn't my view; I would happily add
dependencies to HTTP if they would improve it, although as I'm
constrained by what's in the Platform I won't be going wild any time
soon. The cabal-install bootstrap process should be improved if the
dependencies become prohibitive, e.g. by having hackage generate a
complete download bundle.

Though honestly, I wonder if HTTP should be in the Platform at all. It
certainly wouldn't get in if it were proposed today. On the other hand
an http client is an important battery.

Ganesh
Brandon Allbery | 12 Sep 00:23 2012
Picon

Re: HTTP and character encodings

On Tue, Sep 11, 2012 at 6:08 PM, Ganesh Sittampalam <ganesh <at> earth.li> wrote:
Though honestly, I wonder if HTTP should be in the Platform at all. It
certainly wouldn't get in if it were proposed today. On the other hand
an http client is an important battery.

I think it's mostly there by dint of being a dependency of cabal-install? 

--
brandon s allbery                                      allbery.b <at> gmail.com
wandering unix systems administrator (available)     (412) 475-9364 vm/sms

_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Christian Maeder | 12 Sep 12:09 2012
Picon

Re: HTTP and character encodings

Am 11.09.2012 19:38, schrieb Ganesh Sittampalam:
> On 11/09/2012 09:30, Christian Maeder wrote:
>> Am 11.09.2012 00:22, schrieb Ganesh Sittampalam:
>>> Hi,
>>>
>>> tl;dr: I'd like to remove the String instances from the HTTP package.
>>>
>>> The HTTP library is overloaded on the type for request and response
>>> bodies; there are instances for String and both strict and lazy
>>> Bytestrings.
>>>
>>> Unfortunately, the String instance is rather broken. A String ought to
>>> represent Unicode data, but the HTTP wire format is bytes, and HTTP
>>> makes no attempt to handle encoding.
>>
>> if you remove the String instance I would need to encode my strings
>> manually (and maybe worse than it is done now).
>
> The obvious way to encode them is to use ByteString.Char8.pack which is
> exactly what HTTP does now. I can't really think of anything worse that
> someone might do by accident.

My main use-case is simpleHTTP that is bound to the String instance, 
currently. There are no such short-cuts for byte-strings, are there?

I'ld suggest to make a proper byte-string interface first and then 
deprecate the String stuff.

(before calling Char8.pack, strings could be checked or filtered for 
"isAscii")

Cheers Christian
Ganesh Sittampalam | 12 Sep 23:57 2012
Picon

Re: HTTP and character encodings

On 12/09/2012 11:09, Christian Maeder wrote:

> My main use-case is simpleHTTP that is bound to the String instance,
> currently. There are no such short-cuts for byte-strings, are there?

That's a good point. I guess I would make simpleHTTP overloaded while I
was making breaking changes anyway.

> I'ld suggest to make a proper byte-string interface first 

What do you mean by "proper"? Unfortunately I don't really have time to
do any substantial refactoring in the near future.

Given lots of time now, I'd immediately make high-level and low-level
interfaces with encoding only handled in the high-level one.

> and then deprecate the String stuff.

Is it possible to deprecate an instance?

I could perhaps instead provide an escape hatch with a newtype like
UnsafeChar8String or something, either temporarily or permanently.

> (before calling Char8.pack, strings could be checked or filtered for
> "isAscii")

The problem is more on the download side; if it's a wide encoding like
UTF-16, even 7-bit cleanliness isn't enough to make Char8.unpack safe.
On the upload side, automatically using UTF-8 would probably be good enough.

Cheers,

Ganesh
Christian Maeder | 13 Sep 10:21 2012
Picon

Re: HTTP and character encodings

Am 12.09.2012 23:57, schrieb Ganesh Sittampalam:
> On 12/09/2012 11:09, Christian Maeder wrote:
>
>> My main use-case is simpleHTTP that is bound to the String instance,
>> currently. There are no such short-cuts for byte-strings, are there?
>
> That's a good point. I guess I would make simpleHTTP overloaded while I
> was making breaking changes anyway.

Ah, I thought about something like "simpleByteStringHTTP".

>> I'ld suggest to make a proper byte-string interface first
>
> What do you mean by "proper"? Unfortunately I don't really have time to
> do any substantial refactoring in the near future.
>
> Given lots of time now, I'd immediately make high-level and low-level
> interfaces with encoding only handled in the high-level one.
>
>> and then deprecate the String stuff.
>
> Is it possible to deprecate an instance?

I believe, no. So forget deprecation (just document it) but consider to 
remain backward compatible.

> I could perhaps instead provide an escape hatch with a newtype like
> UnsafeChar8String or something, either temporarily or permanently.
>
>> (before calling Char8.pack, strings could be checked or filtered for
>> "isAscii")
>
> The problem is more on the download side; if it's a wide encoding like
> UTF-16, even 7-bit cleanliness isn't enough to make Char8.unpack safe.

Just to make the string instance work, it is enough to ignore encoding 
and return only ascii bytes as chars or change bytes 128--255 to a 
replacement ascii char (i.e. '?').

For proper encodings other functions or (text) instances must be used.

> On the upload side, automatically using UTF-8 would probably be good enough.
>
> Cheers,
>
> Ganesh
>
>
Ganesh Sittampalam | 13 Sep 19:31 2012
Picon

Re: HTTP and character encodings

On 13/09/2012 09:21, Christian Maeder wrote:

> Just to make the string instance work, it is enough to ignore encoding
> and return only ascii bytes as chars or change bytes 128--255 to a
> replacement ascii char (i.e. '?').

I don't think it's really any better than using Char8.unpack. Depending
on the actual encoding you'll get variously broken results either way.

Cheers,

Ganesh
Duncan Coutts | 12 Sep 23:34 2012

Re: HTTP and character encodings

On 11 September 2012 00:22, Ganesh Sittampalam <ganesh <at> earth.li> wrote:

> So I'm reluctantly drawn to the conclusion that the only reasonable
> thing to do is to remove the String instances from HTTP completely for now.
>
> I imagine this could be quite disruptive, but on the other hand people
> using the String instance are getting silently broken behaviour and a
> couple of people have been bitten by this recently.
>
> Any thoughts?

Yes. And I'd be in favour of removing the class entirely. Just use a
single ByteString type. I don't think the overloading buys us
anything. As for the effect on cabal-install, I've no problem with
making the appropriate fixes.

As for the pipes, conduits etc etc. My hope is that will stabilise at
some point with a clear right winner and we can adopt one of them, add
it to the platform etc.

(Personally I hope the "doing it right" approach of pipes works out in practice)

Duncan
Ganesh Sittampalam | 13 Sep 19:32 2012
Picon

Re: HTTP and character encodings

On 12/09/2012 22:34, Duncan Coutts wrote:

> Yes. And I'd be in favour of removing the class entirely. Just use a
> single ByteString type. I don't think the overloading buys us
> anything.

Which one should it use, lazy bytestring?

I'm not particularly keen on removing the overloading as I don't think
keeping it costs much for now and I kind of like the idea. We could even
replace String with [Word8] though that seems rather pointless in practice.

On the other hand if there's strong feelings in favour of removing it,
now is a good opportunity since there'll be a breaking change anyway.

Ganesh
Bryan O'Sullivan | 13 Sep 20:24 2012

Re: HTTP and character encodings

On Thu, Sep 13, 2012 at 10:32 AM, Ganesh Sittampalam <ganesh <at> earth.li> wrote:


> Yes. And I'd be in favour of removing the class entirely. Just use a
> single ByteString type. I don't think the overloading buys us
> anything.

Which one should it use, lazy bytestring?

Probably yes, assuming we want to retain the ability to lazily stream responses. Which is very nearly the only raison d'etre of the HTTP package at this point.
 
I'm not particularly keen on removing the overloading as I don't think
keeping it costs much for now and I kind of like the idea.

It doesn't cost much, but it also seems to no longer have any benefit, which suggests that it could usefully be dropped.
_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Ganesh Sittampalam | 13 Sep 22:11 2012
Picon

Re: HTTP and character encodings

On 13/09/2012 19:24, Bryan O'Sullivan wrote:

> Probably yes, assuming we want to retain the ability to lazily stream
> responses. Which is very nearly the only raison d'etre of the HTTP
> package at this point.

Also that it's in the Platform and is kind of needed there for
cabal-install.

>     I'm not particularly keen on removing the overloading as I don't think
>     keeping it costs much for now and I kind of like the idea.
> 
> 
> It doesn't cost much, but it also seems to no longer have any benefit,
> which suggests that it could usefully be dropped.

I view easy switching between lazy and strict bytestrings as a benefit.

Ganesh
Herbert Valerio Riedel | 13 Sep 14:46 2012
Picon

Re: HTTP and character encodings

Ganesh Sittampalam <ganesh <at> earth.li> writes:

[...]

> So I'm reluctantly drawn to the conclusion that the only reasonable
> thing to do is to remove the String instances from HTTP completely for now.

+1 

...and in case this is open for debate: +1 for getting rid of the
typeclass abstraction altogether (like e.g. Duncan suggested)

Gmane