nadine.and.henry | 13 Oct 16:37 2012
Picon

character encoding problems with hxt (I think)


Dear Haskellers,

I'm trying to write some code that grabs countries and provinces from the
iso_3166 files on Linux systems.  I seem to be running into some kind of
character encoding problem.  file says iso_3166_2.xml is a utf8 file, and
isutf8 agrees, but when I run the following code, it crashes.

uft8Copy makes a byte for byte copy as expected.
noCrash read and writes the document without crashing, but the accented
characters in the strings show up garbled.  Just search for "DE" and you'll
see what I mean.  crash (on my system, (Debian testing)) produces the error
message below.

Can anyone enlighten me on what is going on?

Thanks in advance.
Henry Laxen

------------------------------------------------------------------------
{-# LANGUAGE Arrows #-}
import Text.XML.HXT.Core
import Data.List
import qualified System.IO.UTF8 as U

isoFile = "/usr/share/xml/iso-codes/iso_3166_2.xml"

countZerosInLines = length . filter (\x -> x == '0') . concat

utf8Copy = do
(Continue reading)

Uwe Schmidt (FH Wedel | 14 Oct 13:51 2012
Picon

Re: character encoding problems with hxt (I think)

Hi Henry,

it's not an encoding error but an error concerning the validation
of the document.

The document "/usr/share/xml/iso-codes/iso_3166_2.xml"
is not valid with respect to its internal DTD.
The opening tag

<iso_3166_country code="GH">

on line 3711
does not have a closing tag. That closing tag should be inserted
directly in front of

    <!-- Greenland -->
<iso_3166_country code="GL" />

on line 3735. Further the "/" for an empty element
for the "GL" code must be removed.

So it's not an error in HXT, but HXT has found an error in the
input file. The input file is not valid XML.
Solution: Turn off validation and live with the wrong structure
of that document, or correct the XML file (and give the
maintainers of that file a bug report).

But with the error message HXT gave you,
there is some room for improvements.

(Continue reading)


Gmane