Johannes | 15 Dec 16:50 2011
Picon

iteratively parsing HTML with lxml

I recently had situation where I needed to parse a very large (50MB) HTML document, in an environment with memory use limitations. At first I thought that lxml.etree.iterparse was going to be the wrong tool for the job, as etree is for parsing XML, but I saw that there was a html parameter, so that gave me some hope.


Unfortunately, when I tried to parse the document I got XMLSyntaxErrors being raised. Specifically an error due to an element having the same attribute defined twice.

Is it not possible to iteratively parse HTML with the same flexibility as the standard lxml HTML parser?

In its current state, the only way I could get the document to parse, was to delete any lines that threw syntax errors, and then start parsing from scratch. Hardly ideal.

On the other hand, the stdlib HTMLParser dealt with the document without raising any exceptions. I'd love to be able to use lxml.iterparse with its XPath support though.

Any clarification would be great!

Johannes
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 15 Dec 17:19 2011
Picon

Re: iteratively parsing HTML with lxml

Johannes, 15.12.2011 16:50:
> I recently had situation where I needed to parse a very large (50MB) HTML
> document, in an environment with memory use limitations. At first I thought
> that lxml.etree.iterparse was going to be the wrong tool for the job, as
> etree is for parsing XML, but I saw that there was a html parameter, so
> that gave me some hope.
>
> Unfortunately, when I tried to parse the document I got XMLSyntaxErrors
> being raised. Specifically an error due to an element having the same
> attribute defined twice.
>
> Is it not possible to iteratively parse HTML with the same flexibility as
> the standard lxml HTML parser?

There isn't currently a "recover" option, so, no, you can't currently parse 
broken HTML with iterparse(). I'm not sure when libxml2 applies its 
recovery mechanisms (before or after the SAX level), so it may or may not 
be possible to implement that. However, it's likely that this happens 
before calling the SAX interface, which lxml uses for iterparse(), so it 
may be easy after all. Feel free to give it a try.

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane