Moshe Cohen | 29 Aug 01:42

very long files with many XML entity refs

I have a sample XML file which  contains <text>&#135;&#135; .... </text>  with 8,000,000 (eight million) repetitions of '&#135'.

A test program for loading it and then writing it is:

import sys
#import cElementTree as ET
from lxml import etree as ET
f=open(sys.argv[1])
et = ET.ElementTree(file = f)
et.write('ooo')

When it is run with cElementTree , it completes successfully in about 1 minute.
When it is run with lxml, it does not complete, even after 12 hours!!! and the process is constantly at 100% CPU.
Further testing showed it reaches the 'write' statement quite fast and is stuck in there.

Is this a bug or is lxml just dead slow relative to cElementTree , for this action?

Notes:
1) Nothing special about '&#135;', it is just a simple sample with the same character repeating. The original problem showed up with a long file of various entity refs (some encoding of binary data).
2) Testing with shorter files (thousands of characters), seemed to have similar speed for cElementTree  and lxml.

TIA
Moshe

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 29 Aug 09:24

Re: very long files with many XML entity refs

Moshe Cohen wrote:
> I have a sample XML file which  contains <text>&#135;&#135; .... </text>
> with 8,000,000 (eight million) repetitions of '&#135'.
>
> A test program for loading it and then writing it is:
>
> import sys
> #import cElementTree as ET
> from lxml import etree as ET
> f=open(sys.argv[1])
> et = ET.ElementTree(file = f)
> et.write('ooo')
>
> When it is run with cElementTree , it completes successfully in about 1
> minute.

*That* is slow. :)

> When it is run with lxml, it does not complete, even after 12 hours!!!
> and the process is constantly at 100% CPU.
> Further testing showed it reaches the 'write' statement quite fast and is
> stuck in there.
>
> Is this a bug or is lxml just dead slow relative to cElementTree , for
> this action?
>
> Notes:
> 1) Nothing special about '&#135;', it is just a simple sample with the
> same
> character repeating. The original problem showed up with a long file of
> various entity refs (some encoding of binary data).

Well, yes, there is something special about &#135; in that it's not ASCII,
but you are encoding to "US-ASCII", which means that libxml2 has to encode
all non-ASCII characters as character entities.

According to timeit, writing the file out as UTF-8 takes 173 milliseconds
on my machine:

    et.write("eout.xml", encoding="UTF-8")

The complete I/O cycle runs in about two seconds on my machine (after
warm-up), which is a lot faster than one minute :)

However, I do agree that the charref encoding in your example seems to be
impressively slow in libxml2 and I have no idea why, looks like a bug to
me. I'll ask on the libxml2 list.

Stefan

Gmane