Re: iteratively parsing HTML with lxml
Stefan Behnel <stefan_ml <at> behnel.de>
2011-12-15 16:19:02 GMT
Johannes, 15.12.2011 16:50:
> I recently had situation where I needed to parse a very large (50MB) HTML
> document, in an environment with memory use limitations. At first I thought
> that lxml.etree.iterparse was going to be the wrong tool for the job, as
> etree is for parsing XML, but I saw that there was a html parameter, so
> that gave me some hope.
> Unfortunately, when I tried to parse the document I got XMLSyntaxErrors
> being raised. Specifically an error due to an element having the same
> attribute defined twice.
> Is it not possible to iteratively parse HTML with the same flexibility as
> the standard lxml HTML parser?
There isn't currently a "recover" option, so, no, you can't currently parse
broken HTML with iterparse(). I'm not sure when libxml2 applies its
recovery mechanisms (before or after the SAX level), so it may or may not
be possible to implement that. However, it's likely that this happens
before calling the SAX interface, which lxml uses for iterparse(), so it
may be easy after all. Feel free to give it a try.
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de