mharper3 | 4 May 03:14
Picon
Favicon

(no subject)

Hi lxml-dev:

I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal
reproduction) code:

<code>
import lxml.etree

wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/
context = lxml.etree.iterparse(wiki_xml_filename, events=("end"))
for action, elem in context:
    pass
</code>

The crash usually occurs about halfway through the file (around <page> 3,000,000) The same code runs on
smaller mediawiki xml files (200 mb) without error. I only get this error for this very large xml file (in
this case about 13gb uncompressed). I had no trouble parsing the same file with the python standard
library sax parser, but it is much slower and I don't like its api.

I'm using libxml2-2.6.32 (also used earlier versions), python 2.5.2, python-lxml 2.0.5 (also tried
earlier versions), Kubuntu 8.04 with 2.6.24 kernel (also tested on opensuse 10.3 with earlier kernel).

Some of the exceptions are MemoryErrors. The machine running the code has 4gb of ram. The kernel does not
appear to significantly hit the swap during the run.

Here are the errors:

** glibc detected *** python: free(): invalid pointer: 0x08220a15 ***
Aborted

(Continue reading)

Stefan Behnel | 4 May 07:34
Picon

Re: (no subject)

Hi,

mharper3 <at> uiuc.edu wrote:
> I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal
reproduction) code:
> 
> <code>
> import lxml.etree
> 
> wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/
> context = lxml.etree.iterparse(wiki_xml_filename, events=("end"))
> for action, elem in context:
>     pass
> </code>
> 
> The crash usually occurs about halfway through the file (around <page>
> 3,000,000) The same code runs on smaller mediawiki xml files (200 mb)
> without error. I only get this error for this very large xml file (in this
> case about 13gb uncompressed). I had no trouble parsing the same file with
> the python standard library sax parser, but it is much slower and I don't
> like its api.
>
> Some of the exceptions are MemoryErrors. The machine running the code has
> 4gb of ram. The kernel does not appear to significantly hit the swap during
> the run.

iterparse() builds a tree in memory, so parsing a 13gb file on a 4gb RAM
machine will fail - *unless* you clean up the parts of the tree that you no
longer need.

(Continue reading)


Gmane