roger patterson | 15 May 06:21

Re: html entities and lxml.html.ElementSoup

Hi Viksit,

What you typed was correct, except you have to note that
lxml.html.soupparser.convert_tree(soup) returns a *list* of root
elements, so you can't just do a lxml.etree.tostring() on the list.
Depending on your HTML, choosing the first element will probably work.

I have moved to the trunk now, so am working well with the new
lxml.html.soupparser.  But if you're stuck on that branch, then that
work-around worked for me.  Hope it works for you!
cheers
-Roger

2008/5/14 Viksit Gaur <viksit <at> aya.yale.edu>:
> Hi there,
>
>>Roger Patterson wrote:
>>> I'm getting an interesting situation.  When using the very cool
>>> ElementSoup add-on to lxml.html with certain source-html files that
>>> already encode entities (eg. &#163;), using the ElementSoup.parse()
>>> messes up the entities.
>
> I'm running into the same problem.
>
>>It looks like it's not the parse(), but rather the serialisation. What
>> >happens
>>is that the entity references end up in the /text/ content, which is
>> >clearly
>>wrong as it leads to re-escaping of the references on the way out.
>
(Continue reading)


Gmane