jeff p | 31 Aug 07:59 2012
Picon

Data.Text UTF-8 question

Hello,

I have a sample file (attached) which I cannot read into Text:

    Prelude Control.Applicative> Data.Text.IO.readFile "foo"
    *** Exception: utf8.txt: hGetContents: invalid argument (invalid
byte sequence)

    Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$>
Data.ByteString.Char8.readFile "foo"
    "*** Exception: Cannot decode byte '\x6e':
Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

So it seems that foo doesn't contain valid UTF-8. However,
System.IO.UTF8 has no problem reading the data:

    Prelude Control.Applicative> System.IO.UTF8.readFile "foo"
    "3591,,,dihigma99h,1905,5,25,CUBA,,Matanzas,1971,5,20,CUBA,,Cienfuegos,Martin,Dihigo,,Mart\65533n
Magdaleno Dihigo
    (Llanos),,190,74,R,R,,,,dihigma99,dihigma99,dihim001,dihigma99,dihigma99\r\n"

Shouldn't these all have the same behavior?

I am running on Mac OS X 10.8.1, with GHC 7.4.2 and text-0.11.2.3.

thanks for any insight,
  Jeff
Attachment (foo): application/octet-stream, 247 bytes
(Continue reading)

Gregory Collins | 31 Aug 09:27 2012
Picon

Re: Data.Text UTF-8 question

On Fri, Aug 31, 2012 at 7:59 AM, jeff p <mutjida <at> gmail.com> wrote:

Hello,

I have a sample file (attached) which I cannot read into Text:

    Prelude Control.Applicative> Data.Text.IO.readFile "foo"
    *** Exception: utf8.txt: hGetContents: invalid argument (invalid
byte sequence)

    Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$>
Data.ByteString.Char8.readFile "foo"
    "*** Exception: Cannot decode byte '\x6e':
Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

So it seems that foo doesn't contain valid UTF-8. However,
System.IO.UTF8 has no problem reading the data:

    Prelude Control.Applicative> System.IO.UTF8.readFile "foo"
    "3591,,,dihigma99h,1905,5,25,CUBA,,Matanzas,1971,5,20,CUBA,,Cienfuegos,Martin,Dihigo,,Mart\65533n
Magdaleno Dihigo
    (Llanos),,190,74,R,R,,,,dihigma99,dihigma99,dihim001,dihigma99,dihigma99\r\n"

Shouldn't these all have the same behavior?

\65533 is the unicode replacement character U+FFFD. This means that the source text is not valid UTF-8; the parser in System.IO.UTF8 is silently replacing the bad characters while the others are throwing an exception. If you want the same behaviour with the Text parser, use Data.Text.Encoding.decodeUtf8With which allows you to replicate this. It's likely, however, that your input text is in some other encoding like ISO-8859-1. Use the text-icu package (http://hackage.haskell.org/package/text-icu) to decode these.

G
--
Gregory Collins <greg <at> gregorycollins.net>
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Albert Y. C. Lai | 31 Aug 22:31 2012
Picon

Re: Data.Text UTF-8 question

On 12-08-31 01:59 AM, jeff p wrote:
> I have a sample file (attached) which I cannot read into Text:
>
>      Prelude Control.Applicative> Data.Text.IO.readFile "foo"
>      *** Exception: utf8.txt: hGetContents: invalid argument (invalid
> byte sequence)
>
>      Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$>
> Data.ByteString.Char8.readFile "foo"
>      "*** Exception: Cannot decode byte '\x6e':
> Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

At offsets from 0x55 to 0x5A:

0x4D 0x61 0x72 0x74 0xED 0x6E

This is clearly not UTF-8. This would be, in ISO-8859-1, "Martín".

"Martín" in UTF-8 is 0x4D 0x61 0x72 0x74 0xC3 0xAD 0x6E, and it takes 
one more byte.

And like Gregory Collins says, different UTF-8 decoders may handle 
errors differently. Some abort. Some others fill in a special character.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Gmane