Semyon Kholodnov | 21 Feb 10:58 2013
Picon

How to input Unicode string in Haskell program?

Imagine we have this simple program:

module Main(main) where

main = do
    x <- getLine
    putStrLn x

Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this
string printed back as "résumé 履歴書 резюме". Now, the first problem is
that my computer runs Windows, which means that I can't use ghci
":main" or result of "ghc main.hs" to enter such an outrageous string
— Windows console is locked to one specific local code page, and no
codepage contains Latin-1, Cyrillic and Kanji symbols at the same
time.

But there is also WinGHCi. So I do ":main", copy-paste this string
into the window (It works! Because Windows has Unicode for 20 years
now), but the output is all messed up. In a rather curious way,
actually: the input string is converted to UTF-8 byte string, and its
bytes are treated as being characters from my local code page.

So, it appears that I have no way to enter Unicode strings into my
Haskell programs by hands, I should read them from files. That's sad,
and I refuse to think I am the first one with such a problem, so I
assume there is a solution/workaround. Now would someone please tell
me this solution? Except from "Just stick to 127 letters of ASCII", of
course.

_______________________________________________
(Continue reading)

MigMit | 21 Feb 12:05 2013
Picon

Re: How to input Unicode string in Haskell program?

Have you tried running ghci inside Emacs?

Отправлено с iPhone

21.02.2013, в 13:58, Semyon Kholodnov <joker.vd <at> gmail.com> написал(а):

> Imagine we have this simple program:
> 
> module Main(main) where
> 
> main = do
>    x <- getLine
>    putStrLn x
> 
> Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this
> string printed back as "résumé 履歴書 резюме". Now, the first problem is
> that my computer runs Windows, which means that I can't use ghci
> ":main" or result of "ghc main.hs" to enter such an outrageous string
> — Windows console is locked to one specific local code page, and no
> codepage contains Latin-1, Cyrillic and Kanji symbols at the same
> time.
> 
> But there is also WinGHCi. So I do ":main", copy-paste this string
> into the window (It works! Because Windows has Unicode for 20 years
> now), but the output is all messed up. In a rather curious way,
> actually: the input string is converted to UTF-8 byte string, and its
> bytes are treated as being characters from my local code page.
> 
> So, it appears that I have no way to enter Unicode strings into my
> Haskell programs by hands, I should read them from files. That's sad,
(Continue reading)

Alexander V Vershilov | 21 Feb 12:07 2013
Picon

Re: How to input Unicode string in Haskell program?

The problem is that Prelude.getLine uses current locale to load characters:
for example if you have utf8 locale, then everything works out of the box:

> $ runhaskell 1.hs
> résumé 履歴書 резюме
> résumé 履歴書 резюме

But if you change locale you'll have error:

> LANG="C" runhaskell 1.hs
> résumé 履歴書 резюме
> 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)

To force haskell use UTF8 you can load string as byte sequence and convert it to UTF-8
charecters for example by

import qualified Data.ByteString as S
import qualified Data.Text.Encoding as T

main = do
    x <- fmap T.decodeUtf8 S.getLine

now code will work even with different locale, and you'll load UTF8 from shell
 independenty of user input's there

--
Alexander


On 21 February 2013 13:58, Semyon Kholodnov <joker.vd <at> gmail.com> wrote:
Imagine we have this simple program:

module Main(main) where

main = do
    x <- getLine
    putStrLn x

Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this
string printed back as "résumé 履歴書 резюме". Now, the first problem is
that my computer runs Windows, which means that I can't use ghci
":main" or result of "ghc main.hs" to enter such an outrageous string
— Windows console is locked to one specific local code page, and no
codepage contains Latin-1, Cyrillic and Kanji symbols at the same
time.

But there is also WinGHCi. So I do ":main", copy-paste this string
into the window (It works! Because Windows has Unicode for 20 years
now), but the output is all messed up. In a rather curious way,
actually: the input string is converted to UTF-8 byte string, and its
bytes are treated as being characters from my local code page.

So, it appears that I have no way to enter Unicode strings into my
Haskell programs by hands, I should read them from files. That's sad,
and I refuse to think I am the first one with such a problem, so I
assume there is a solution/workaround. Now would someone please tell
me this solution? Except from "Just stick to 127 letters of ASCII", of
course.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe



--
Alexander
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Erik Hesselink | 21 Feb 13:44 2013
Picon

Re: How to input Unicode string in Haskell program?

You can also set the locale encoding for a handle (e.g.
System.IO.stdin) from code using `System.IO.hSetEncoding` [0].

Erik

[0] http://hackage.haskell.org/packages/archive/base/latest/doc/html/System-IO.html#v:hSetEncoding

On Thu, Feb 21, 2013 at 12:07 PM, Alexander V Vershilov
<alexander.vershilov <at> gmail.com> wrote:
> The problem is that Prelude.getLine uses current locale to load characters:
> for example if you have utf8 locale, then everything works out of the box:
>
>> $ runhaskell 1.hs
>> résumé 履歴書 резюме
>> résumé 履歴書 резюме
>
> But if you change locale you'll have error:
>
>> LANG="C" runhaskell 1.hs
>> résumé 履歴書 резюме
>> 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)
>
> To force haskell use UTF8 you can load string as byte sequence and convert
> it to UTF-8
> charecters for example by
>
> import qualified Data.ByteString as S
> import qualified Data.Text.Encoding as T
>
> main = do
>     x <- fmap T.decodeUtf8 S.getLine
>
> now code will work even with different locale, and you'll load UTF8 from
> shell
>  independenty of user input's there
>
> --
> Alexander
>
>
> On 21 February 2013 13:58, Semyon Kholodnov <joker.vd <at> gmail.com> wrote:
>>
>> Imagine we have this simple program:
>>
>> module Main(main) where
>>
>> main = do
>>     x <- getLine
>>     putStrLn x
>>
>> Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this
>> string printed back as "résumé 履歴書 резюме". Now, the first problem is
>> that my computer runs Windows, which means that I can't use ghci
>> ":main" or result of "ghc main.hs" to enter such an outrageous string
>> — Windows console is locked to one specific local code page, and no
>> codepage contains Latin-1, Cyrillic and Kanji symbols at the same
>> time.
>>
>> But there is also WinGHCi. So I do ":main", copy-paste this string
>> into the window (It works! Because Windows has Unicode for 20 years
>> now), but the output is all messed up. In a rather curious way,
>> actually: the input string is converted to UTF-8 byte string, and its
>> bytes are treated as being characters from my local code page.
>>
>> So, it appears that I have no way to enter Unicode strings into my
>> Haskell programs by hands, I should read them from files. That's sad,
>> and I refuse to think I am the first one with such a problem, so I
>> assume there is a solution/workaround. Now would someone please tell
>> me this solution? Except from "Just stick to 127 letters of ASCII", of
>> course.
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe <at> haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
>
>
> --
> Alexander
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Jon Fairbairn | 22 Feb 11:38 2013
X-Face
Picon
Picon

Re: How to input Unicode string in Haskell program?

Alexander V Vershilov <alexander.vershilov <at> gmail.com> writes:

> The problem is that Prelude.getLine uses current locale to load characters:
> for example if you have utf8 locale, then everything works out of the box:
>
>> $ runhaskell 1.hs
>> résumé 履歴書 резюме
>> résumé 履歴書 резюме
>
> But if you change locale you'll have error:
>
>> LANG="C" runhaskell 1.hs
>> résumé 履歴書 резюме
>> 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)

That seems to be correct behaviour: the only way to know the
meaning of the bits input by a user is what encoding the user
says they are in.

But in general this issue is an instance of inheriting sins from
the OS: the meaning of the bit pattern in a file should be part
of the file, but we are stuck with OSs that use a global
variable (which should be anathema to Haskell). So if user A has
locale set one way and inputs a file and sends the filename to
user B on the same system, user B might well see something
completely different to A when looking at the file.

> To force haskell use UTF8 you can load string as byte sequence
> and convert it to UTF-8 charecters

but of course, the programmer can only hope that utf-8 will work
here. If the user is typing in KOI-8R, reading it as utf-8 is
going to be wrong.
--

-- 
Jón Fairbairn                                 Jon.Fairbairn <at> cl.cam.ac.uk

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Semyon Kholodnov | 22 Feb 17:17 2013
Picon

Re: How to input Unicode string in Haskell program?

I would like to point out again that I am talking about Windows. I
don't care about Linux—I'm sure you already threw away all those
stupid legacy one- and multibyte code pages and migrated to UTF8
completely, but that's not quite the current state of Windows. Console
still doesn't cope with Unicode quite well.

Anyway, the problem is partially solved: I patched my WinGHCi so it no
longer chokes on Unicode input, and as for compiled .exe... I'll see.

2013/2/22, Jon Fairbairn <jon.fairbairn <at> cl.cam.ac.uk>:
> Alexander V Vershilov <alexander.vershilov <at> gmail.com> writes:
>
>> The problem is that Prelude.getLine uses current locale to load
>> characters:
>> for example if you have utf8 locale, then everything works out of the box:
>>
>>> $ runhaskell 1.hs
>>> résumé 履歴書 резюме
>>> résumé 履歴書 резюме
>>
>> But if you change locale you'll have error:
>>
>>> LANG="C" runhaskell 1.hs
>>> résumé 履歴書 резюме
>>> 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)
>
> That seems to be correct behaviour: the only way to know the
> meaning of the bits input by a user is what encoding the user
> says they are in.
>
> But in general this issue is an instance of inheriting sins from
> the OS: the meaning of the bit pattern in a file should be part
> of the file, but we are stuck with OSs that use a global
> variable (which should be anathema to Haskell). So if user A has
> locale set one way and inputs a file and sends the filename to
> user B on the same system, user B might well see something
> completely different to A when looking at the file.
>
>> To force haskell use UTF8 you can load string as byte sequence
>> and convert it to UTF-8 charecters
>
> but of course, the programmer can only hope that utf-8 will work
> here. If the user is typing in KOI-8R, reading it as utf-8 is
> going to be wrong.
> --
> Jón Fairbairn                                 Jon.Fairbairn <at> cl.cam.ac.uk
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Albert Y. C. Lai | 22 Feb 21:15 2013
Picon

Re: How to input Unicode string in Haskell program?

On 13-02-21 04:58 AM, Semyon Kholodnov wrote:
> — Windows console is locked to one specific local code page, and no
> codepage contains Latin-1, Cyrillic and Kanji symbols at the same
> time.

Windows console is not locked to an anti-international code page; it is 
only defaulted to.

Use CHCP 65001 to switch to the UTF-8 code page.

Unfortunately, code page and encoding is only half of the battle; the 
other half is fonts. Most Windows fonts are incomplete; all Windows 
fixed-width fonts are incomplete. (Silver lining: Arial Unicode is 
sufficiently complete.) Therefore, you may be unable to display some 
characters, but they are the correct characters.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Gmane