Martin Hilbig | 11 Feb 14:56 2013
Picon

Text.JSON and utf8

hi,

tl;dr: i propose this patch to Text/JSON/String.hs and would like to
know why it is needed:

 <at>  <at>  -375,7 +375,7  <at>  <at> 
    where
    go s1 =
      case s1 of
-      (x   :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs)
+      (x   :xs) | x < '\x20' -> '\\' : encControl x (go xs)
        ('"' :xs)              -> '\\' : '"'  : go xs
        ('\\':xs)              -> '\\' : '\\' : go xs
        (x   :xs)              -> x    : go xs

i recently stumbled upon CouchDB telling me i'm sending invalid json.

i basically read lines from a utf8 file with german umlauts and send
them to CouchDB using Text.JSON and Database.CouchDB.

   $ file lines.txt
   lines.txt: UTF-8 Unicode text

lets take 'ö' as an example. i use LANG=de_DE.utf8

ghci tells

 > 'ö'
'\246'

(Continue reading)

Gregory Collins | 11 Feb 15:14 2013
Picon

Re: Text.JSON and utf8

Don't use the json package, use aeson instead. (It's much faster and handles encoding issues correctly).

G


On Mon, Feb 11, 2013 at 2:56 PM, Martin Hilbig <lists <at> mhilbig.de> wrote:
hi,

tl;dr: i propose this patch to Text/JSON/String.hs and would like to
know why it is needed:

<at> <at> -375,7 +375,7 <at> <at>
   where
   go s1 =
     case s1 of
-      (x   :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs)
+      (x   :xs) | x < '\x20' -> '\\' : encControl x (go xs)
       ('"' :xs)              -> '\\' : '"'  : go xs
       ('\\':xs)              -> '\\' : '\\' : go xs
       (x   :xs)              -> x    : go xs


i recently stumbled upon CouchDB telling me i'm sending invalid json.

i basically read lines from a utf8 file with german umlauts and send
them to CouchDB using Text.JSON and Database.CouchDB.

  $ file lines.txt
  lines.txt: UTF-8 Unicode text

lets take 'ö' as an example. i use LANG=de_DE.utf8

ghci tells

> 'ö'
'\246'

> putChar '\246'
ö

> putChar 'ö'
ö

> :m + Text.JSON Database.CouchDB
> runCouchDB' $ newNamedDoc (db "foo") (doc "bar") (showJSON $ toJSObject [("test","ö")])
*** Exception: HTTP/1.1 400 Bad Request
Server: CouchDB/1.2.1 (Erlang OTP/R15B03)
Date: Mon, 11 Feb 2013 13:24:49 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 48
Cache-Control: must-revalidate

couchdb log says:

  Invalid JSON: {{error,{10,"lexical error: invalid bytes in UTF8 string.\n"}},<<"{\"test\":\"<F6>\"}">>}

this is indeed hex ö:

> :m + Numeric
> putChar $ toEnum $ fst $ head $ readHex "f6"
ö

if i apply the above patch and reinstall JSON and CouchDB the doc
creation works:

> runCouchDB' $ newNamedDoc (db "db") (doc "foo") (showJSON $ toJSObject [("test", "ö")])
Right someRev

but i dont get back the ö i expected:

> Just (_,_,x) <-runCouchDB' $ getDoc (db "foo") (doc "bar") :: IO (Maybe (Doc,Rev,JSObject String))
> let Ok y = valFromObj "test" =<< readJSON x :: Result String
> y
"\195\188"
> putStrLn y
ü

apperently with curl everything works fine:

$ curl localhost:5984/db/foo -XPUT -d '{"test": "ö"}'
{"ok":true,"id":"foo","rev":"someOtherRev"}
$ curl localhost:5984/db/foo
{"_id":"bars","_rev":"someOtherRev","test":"ö"}

so how can i get my precious ö back? what am i doing wrong or does Text.JSON need another patch?

another question: why does encControl in Text/JSON/String.hs handle the
cases x < '\x100' and x < '\x1000' even though they can never be
reached with the old predicate in encJSString (x < '\x20')

finally: is '\x7e' the right literal for the job?

thanks for reading

have fun
martin

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe



--
Gregory Collins <greg <at> gregorycollins.net>
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Iavor Diatchki | 16 Feb 18:59 2013
Picon

Re: Text.JSON and utf8

Hello Martin,

the change that you propose seems to already be in json-0.7.  Perhaps you just need to 'cabal update' and install the most recent version?

About your other question:  I have not used CouchDB but a common mistake is to mix up strings and bytes.  Perhaps the `getDoc` function does not do utf-8 decoding and so it is giving you back list of bytes (as a String)?

In general, the JSON package only converts between JSON and String, and is agnostic to what encoding is used to represent the strings.   There are other packages that convert Strings into bytes (e.g., http://hackage.haskell.org/package/utf8-string), so typically you want to encode the string to bytes before you export it (say to CouchDB), and decode it back into a string just after you've imported it.

-Iavor





On Mon, Feb 11, 2013 at 5:56 AM, Martin Hilbig <lists <at> mhilbig.de> wrote:
hi,

tl;dr: i propose this patch to Text/JSON/String.hs and would like to
know why it is needed:

<at> <at> -375,7 +375,7 <at> <at>
   where
   go s1 =
     case s1 of
-      (x   :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs)
+      (x   :xs) | x < '\x20' -> '\\' : encControl x (go xs)
       ('"' :xs)              -> '\\' : '"'  : go xs
       ('\\':xs)              -> '\\' : '\\' : go xs
       (x   :xs)              -> x    : go xs


i recently stumbled upon CouchDB telling me i'm sending invalid json.

i basically read lines from a utf8 file with german umlauts and send
them to CouchDB using Text.JSON and Database.CouchDB.

  $ file lines.txt
  lines.txt: UTF-8 Unicode text

lets take 'ö' as an example. i use LANG=de_DE.utf8

ghci tells

> 'ö'
'\246'

> putChar '\246'
ö

> putChar 'ö'
ö

> :m + Text.JSON Database.CouchDB
> runCouchDB' $ newNamedDoc (db "foo") (doc "bar") (showJSON $ toJSObject [("test","ö")])
*** Exception: HTTP/1.1 400 Bad Request
Server: CouchDB/1.2.1 (Erlang OTP/R15B03)
Date: Mon, 11 Feb 2013 13:24:49 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 48
Cache-Control: must-revalidate

couchdb log says:

  Invalid JSON: {{error,{10,"lexical error: invalid bytes in UTF8 string.\n"}},<<"{\"test\":\"<F6>\"}">>}

this is indeed hex ö:

> :m + Numeric
> putChar $ toEnum $ fst $ head $ readHex "f6"
ö

if i apply the above patch and reinstall JSON and CouchDB the doc
creation works:

> runCouchDB' $ newNamedDoc (db "db") (doc "foo") (showJSON $ toJSObject [("test", "ö")])
Right someRev

but i dont get back the ö i expected:

> Just (_,_,x) <-runCouchDB' $ getDoc (db "foo") (doc "bar") :: IO (Maybe (Doc,Rev,JSObject String))
> let Ok y = valFromObj "test" =<< readJSON x :: Result String
> y
"\195\188"
> putStrLn y
ü

apperently with curl everything works fine:

$ curl localhost:5984/db/foo -XPUT -d '{"test": "ö"}'
{"ok":true,"id":"foo","rev":"someOtherRev"}
$ curl localhost:5984/db/foo
{"_id":"bars","_rev":"someOtherRev","test":"ö"}

so how can i get my precious ö back? what am i doing wrong or does Text.JSON need another patch?

another question: why does encControl in Text/JSON/String.hs handle the
cases x < '\x100' and x < '\x1000' even though they can never be
reached with the old predicate in encJSString (x < '\x20')

finally: is '\x7e' the right literal for the job?

thanks for reading

have fun
martin

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Gmane