Richard Baron Penman | 24 Aug 13:16

Text obscured by subelement

hello,

I have a document with a format like this:
<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>

I want to extract 'text1text3text5' from <doc> but the text attribute returns just 'text1'. Here is an example:

from lxml import html
doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
print doc.text # 'text1'
print doc.tail # ''
print doc.text_content() # 'text1text2text3text4text5'

for child in doc:
    child.drop_tree()
print doc.text # 'text1text3text5'


From the example you can see I can get what I want by first dropping the subelements.
Is there a better way to access this text?

regards,
Richard
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
John J Lee | 25 Aug 00:03
Favicon

Re: Text obscured by subelement

On Sun, 24 Aug 2008, Richard Baron Penman wrote:
>
> I have a document with a format like this:
> <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
>
> I want to extract 'text1text3text5' from <doc> but the text attribute
> returns just 'text1'. Here is an example:
>
> from lxml import html
> doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
[...]
>> From the example you can see I can get what I want by first dropping the
> subelements.
> Is there a better way to access this text?
[...]

I only have 1.3.6 installed, so don't have the HTML support, but you want 
to use the .tail of the b elements I think.  With the XML API:

from lxml.etree import fromstring
doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
b1, b2 = doc.getchildren()
print doc.text + b1.tail + b2.tail

John
Piet van Oostrum | 25 Aug 00:16

Re: Text obscured by subelement

>>>>> John J Lee <jjl <at> pobox.com> (JJL) wrote:

>JJL> On Sun, 24 Aug 2008, Richard Baron Penman wrote:
>>> 
>>> I have a document with a format like this:
>>> <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
>>> 
>>> I want to extract 'text1text3text5' from <doc> but the text attribute
>>> returns just 'text1'. Here is an example:
>>> 
>>> from lxml import html
>>> doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
>JJL> [...]
>>>> From the example you can see I can get what I want by first dropping the
>>> subelements.
>>> Is there a better way to access this text?
>JJL> [...]

>JJL> I only have 1.3.6 installed, so don't have the HTML support, but you want 
>JJL> to use the .tail of the b elements I think.  With the XML API:

>JJL> from lxml.etree import fromstring
>JJL> doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
>JJL> b1, b2 = doc.getchildren()
>JJL> print doc.text + b1.tail + b2.tail

print doc.text+''.join(c.tail for c in doc.getchildren())
--

-- 
Piet van Oostrum <piet <at> cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet <at> vanoostrum.org
Stefan Behnel | 25 Aug 04:36

Re: Text obscured by subelement


Piet van Oostrum wrote:
>>>>>> John J Lee <jjl <at> pobox.com> (JJL) wrote:
>> JJL> doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
>> JJL> b1, b2 = doc.getchildren()
>> JJL> print doc.text + b1.tail + b2.tail
> 
> print doc.text+''.join(c.tail for c in doc.getchildren())

print doc.text+''.join(c.tail for c in doc)

Stefan

Gmane