15 May 06:21
Re: html entities and lxml.html.ElementSoup
From: roger patterson <rogerpatterson <at> gmail.com>
Subject: Re: html entities and lxml.html.ElementSoup
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-15 04:21:52 GMT
Subject: Re: html entities and lxml.html.ElementSoup
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-15 04:21:52 GMT
Hi Viksit, What you typed was correct, except you have to note that lxml.html.soupparser.convert_tree(soup) returns a *list* of root elements, so you can't just do a lxml.etree.tostring() on the list. Depending on your HTML, choosing the first element will probably work. I have moved to the trunk now, so am working well with the new lxml.html.soupparser. But if you're stuck on that branch, then that work-around worked for me. Hope it works for you! cheers -Roger 2008/5/14 Viksit Gaur <viksit <at> aya.yale.edu>: > Hi there, > >>Roger Patterson wrote: >>> I'm getting an interesting situation. When using the very cool >>> ElementSoup add-on to lxml.html with certain source-html files that >>> already encode entities (eg. £), using the ElementSoup.parse() >>> messes up the entities. > > I'm running into the same problem. > >>It looks like it's not the parse(), but rather the serialisation. What >> >happens >>is that the entity references end up in the /text/ content, which is >> >clearly >>wrong as it leads to re-escaping of the references on the way out. >(Continue reading)
RSS Feed