wren ng thornton | 28 Apr 23:14 2013

Space leak in hexpat-0.20.3/List-0.5.1

Hello all,

So I'm processing a large XML file which is a database of about 170k
entries, each of which is a reasonable enough size on its own, and I only
need streaming access to the database (basically printing out summary data
for each entry). Excellent, sounds like a job for SAX.

However, after whipping up a simplified version of the program using
hexpat, there's a space leak. Near as I can tell, it's not a problem with
my code, it's a problem with Data.List.Class (or hexpat's use thereof).
The simplified code follows, just compile it for profiling and use hp2ps
to see what I mean. The file I'm running it on can be found at:

    ftp://ftp.monash.edu.au/pub/nihongo/JMdict.gz

Any ideas on what the problem really is, or how to fix it?

----------------------------------------------------------------
----------------------------------------------------------------
module JMdict (main) where

import           Text.XML.Expat.SAX   (SAXEvent(..))
import qualified Text.XML.Expat.SAX   as SAX
import           Text.XML.Expat.Tree  (NodeG(..))
import qualified Text.XML.Expat.Tree  as DOM
import qualified Text.XML.Expat.Proc  as Proc
import qualified Text.XML.Expat.Internal.NodeClass as Node
import qualified Data.ByteString.Lazy as BL
import           Data.Char            (isSpace)
import           Data.Text.IO         as TIO
(Continue reading)

oleg | 1 May 07:57 2013

Re: Space leak in hexpat-0.20.3/List-0.5.1


Wren Thornton wrote:
> So I'm processing a large XML file which is a database of about 170k
> entries, each of which is a reasonable enough size on its own, and I only
> need streaming access to the database (basically printing out summary data
> for each entry). Excellent, sounds like a job for SAX.

Indeed a good job for a SAX-like parser. XMLIter is exactly such
parser, and it generates event stream quite like that of Expat. Also
you application is somewhat similar to the following
        http://okmij.org/ftp/Haskell/Iteratee/XMLookup.hs

So, it superficially seems XMLIter should be up for the task. Can you
explain which elements your are counting? BTW, xml_enum already checks
for the well-formedness of XML (including the start-end tag
balance, and many more criteria). One can assume that the XMLStream
corresponds to the well-formed document and only count the desired
start tags (or end tags, for that matter).

Gmane