Christian Maeder | 20 Feb 12:30 2014
Picon

haskell xml parsing for larger files?

Hi,

I've got some difficulties parsing "large" xml files (> 100MB).
A plain SAX parser, as provided by hexpat, is fine. However, 
constructing a tree consumes too much memory on a 32bit machine.

see http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248

I suspect that sharing strings when constructing trees might greatly 
reduce memory requirements. What are suitable libraries for string pools?

Before trying to implement something myself, I'ld like to ask who else 
has tried to process large xml files (and met similar memory problems)?

I have not yet investigated xml-conduit and hxt for our purpose. (These 
look scary.)

In fact, I've basically used the content trees from "The (simple) xml 
package" and switching to another tree type is no fun, in particular if 
this gains not much.

Thanks Christian
Mathieu Boespflug | 20 Feb 16:06 2014

Re: haskell xml parsing for larger files?

Hi Christian,

as regards your question about sharing strings, there are a number of
libraries on Hackage to achieve this, e.g. in the context of compiler
symbols. To cite only a few: intern, stringtable-atom, simple-atom.
I'm sure there are others.

Best,
--
Mathieu Boespflug
Founder at http://tweag.io.

On Thu, Feb 20, 2014 at 12:30 PM, Christian Maeder
<Christian.Maeder <at> dfki.de> wrote:
> Hi,
>
> I've got some difficulties parsing "large" xml files (> 100MB).
> A plain SAX parser, as provided by hexpat, is fine. However, constructing a
> tree consumes too much memory on a 32bit machine.
>
> see http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248
>
> I suspect that sharing strings when constructing trees might greatly reduce
> memory requirements. What are suitable libraries for string pools?
>
> Before trying to implement something myself, I'ld like to ask who else has
> tried to process large xml files (and met similar memory problems)?
>
> I have not yet investigated xml-conduit and hxt for our purpose. (These look
> scary.)
(Continue reading)

malcolm.wallace | 20 Feb 16:49 2014

Re: haskell xml parsing for larger files?

Is your usage pattern over the constructed tree likely to be a lazy prefix traversal?  If so, then HaXml supports lazy construction of the parse tree.  Some plots appear at the end of this paper, showing how memory usage can be reduced to a constant, even for very large inputs (1 million tree nodes):

http://www.cs.york.ac.uk/plasma/publications/pdf/partialparse.pdf
Regards, Malcolm

On 20 Feb, 2014,at 11:30 AM, Christian Maeder <Christian.Maeder <at> dfki.de> wrote:

Hi,

I've got some difficulties parsing "large" xml files (> 100MB).
A plain SAX parser, as provided by hexpat, is fine. However,
constructing a tree consumes too much memory on a 32bit machine.

see http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248

I suspect that sharing strings when constructing trees might greatly
reduce memory requirements. What are suitable libraries for string pools?

Before trying to implement something myself, I'ld like to ask who else
has tried to process large xml files (and met similar memory problems)?

I have not yet investigated xml-conduit and hxt for our purpose. (These
look scary.)

In fact, I've basically used the content trees from "The (simple) xml
package" and switching to another tree type is no fun, in particular if
this gains not much.

Thanks Christian
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users <at> haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users <at> haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Christian Maeder | 20 Feb 17:02 2014
Picon

Re: haskell xml parsing for larger files?

I'm afraid our use case is not a lazy prefix traversal.
I'm more shocked that about 100 MB xml content do not fit (as tree) into 
3 GB memory.

Christian

Am 20.02.2014 16:49, schrieb malcolm.wallace:
> Is your usage pattern over the constructed tree likely to be a lazy
> prefix traversal?  If so, then HaXml supports lazy construction of the
> parse tree.  Some plots appear at the end of this paper, showing how
> memory usage can be reduced to a constant, even for very large inputs (1
> million tree nodes):
>
> http://www.cs.york.ac.uk/plasma/publications/pdf/partialparse.pdf
>
> Regards,
>      Malcolm
>
>
> On 20 Feb, 2014,at 11:30 AM, Christian Maeder <Christian.Maeder <at> dfki.de>
> wrote:
>
>> Hi,
>>
>> I've got some difficulties parsing "large" xml files (> 100MB).
>> A plain SAX parser, as provided by hexpat, is fine. However,
>> constructing a tree consumes too much memory on a 32bit machine.
>>
>> see http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248
>>
>> I suspect that sharing strings when constructing trees might greatly
>> reduce memory requirements. What are suitable libraries for string pools?
>>
>> Before trying to implement something myself, I'ld like to ask who else
>> has tried to process large xml files (and met similar memory problems)?
>>
>> I have not yet investigated xml-conduit and hxt for our purpose. (These
>> look scary.)
>>
>> In fact, I've basically used the content trees from "The (simple) xml
>> package" and switching to another tree type is no fun, in particular if
>> this gains not much.
>>
>> Thanks Christian
>> _______________________________________________
>> Glasgow-haskell-users mailing list
>> Glasgow-haskell-users <at> haskell.org
>> <mailto:Glasgow-haskell-users <at> haskell.org>
>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Mateusz Kowalczyk | 20 Feb 17:42 2014
Picon

Re: haskell xml parsing for larger files?

On 20/02/14 11:30, Christian Maeder wrote:
> Hi,
> 
> I've got some difficulties parsing "large" xml files (> 100MB).
> A plain SAX parser, as provided by hexpat, is fine. However, 
> constructing a tree consumes too much memory on a 32bit machine.
> 
> see http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248
> 
> I suspect that sharing strings when constructing trees might greatly 
> reduce memory requirements. What are suitable libraries for string pools?
> 
> Before trying to implement something myself, I'ld like to ask who else 
> has tried to process large xml files (and met similar memory problems)?
> 
> I have not yet investigated xml-conduit and hxt for our purpose. (These 
> look scary.)
> 
> In fact, I've basically used the content trees from "The (simple) xml 
> package" and switching to another tree type is no fun, in particular if 
> this gains not much.
> 
> Thanks Christian
> _______________________________________________
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users <at> haskell.org
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
> 

HXT will not work for you, you will run out of memory on files ~30MB. I
don't know about xml-conduit, I'd love to hear how it goes if you try it.

--

-- 
Mateusz K.

Gmane