Max Ivanov | 23 Aug 09:02

.text_content() should leave spaces. Tests included

Hi! I've run into another strange behaviour. lxml.html.HTMLParser
produces html elements with similair API as Etree elements, but with
some additions. One of them is .text_content() method. Some quote from
docs:  "Returns the text content of the element, including the text
content of its children, with no markup."

So according to description it transforms
"<span>element1</span><span>element2</span>" to "element1element2".
Notice the lack of space between contents of two elements. From my
point of view, that's make this method quite useless, it would be
better if it produce "element1 element2" from same string. Here is a
test fro test_htmlparser.py:

     def test_html_text_content(self):
         from lxml.html import HTMLParser
         element = self.etree.HTML(self.html_str, parser=HTMLParser())
         self.assertEquals(element.text_content(),"test page title")
Stefan Behnel | 23 Aug 09:32

Re: .text_content() should leave spaces. Tests included

Hi,

Max Ivanov wrote:
> I've run into another strange behaviour.

I wouldn't call that "strange behaviour". What you want is a new feature.

> lxml.html.HTMLParser
> produces html elements with similair API as Etree elements, but with
> some additions. One of them is .text_content() method. Some quote from
> docs:  "Returns the text content of the element, including the text
> content of its children, with no markup."
> 
> So according to description it transforms
> "<span>element1</span><span>element2</span>" to "element1element2".
> Notice the lack of space between contents of two elements.

Exactly as in the HTML source, I would say. Given your specific example, I
don't think a browser would display it any different.

> From my
> point of view, that's make this method quite useless, it would be
> better if it produce "element1 element2" from same string. Here is a
> test fro test_htmlparser.py:
> 
>      def test_html_text_content(self):
>          from lxml.html import HTMLParser
>          element = self.etree.HTML(self.html_str, parser=HTMLParser())
>          self.assertEquals(element.text_content(),"test page title")

(Continue reading)

Max Ivanov | 23 Aug 09:57

Re: .text_content() should leave spaces. Tests included

>> So according to description it transforms
>> "<span>element1</span><span>element2</span>" to "element1element2".
>> Notice the lack of space between contents of two elements.
>
> Exactly as in the HTML source, I would say. Given your specific example, I
> don't think a browser would display it any different.
>
Maybe <span> examples are not suitable here. but .text_content() on
"<html><head><title>test</title></head><body><h1>page
title</h1></body></html>" displaying "testpage title" instead of "test
page title" is definitely wrong. Imagine what would happen with
<table> with multiple td's and tr's - it'll transform it to one big
word without spaces. Do you think that it is correct?. Easiest way
will be but spaces between content of any two tags and keep all other
symbols between tags.

>Feel free to provide a patch.
text_method is an alias for XPath("string()"). But I didn't find any
description of just plain string() function, everything I found is an
"string text()" which according to wikipedia returns text content of
elements only one level lower. So I don't understand how all that
works =)
Mike Meyer | 23 Aug 10:34
Face
Favicon

Re: .text_content() should leave spaces. Tests included

On Sat, 23 Aug 2008 11:57:19 +0400
"Max Ivanov" <ivanov.maxim <at> gmail.com> wrote:

> >> So according to description it transforms
> >> "<span>element1</span><span>element2</span>" to "element1element2".
> >> Notice the lack of space between contents of two elements.
> >
> > Exactly as in the HTML source, I would say. Given your specific example, I
> > don't think a browser would display it any different.
> >
> Maybe <span> examples are not suitable here. but .text_content() on
> "<html><head><title>test</title></head><body><h1>page
> title</h1></body></html>" displaying "testpage title" instead of "test
> page title" is definitely wrong. Imagine what would happen with
> <table> with multiple td's and tr's - it'll transform it to one big
> word without spaces. Do you think that it is correct?. Easiest way
> will be but spaces between content of any two tags and keep all other
> symbols between tags.

Easiest way to what? Fix this broken behavior? But it'll break the
correct behavior where inline tags are used to change the rendering of
elements in a word (like <span color="blue">bl</span><span
color="green">een</span>).

If you want it to look like what a browser might render, you want to
put spaces between block elements but not inline elements. Of course,
whether a particular tag is inline or not can be changed by whatever
style sheets are in use. And title - well, it's contents aren't
rendered in the contents of the page at all. So maybe they should just
vanish?
(Continue reading)

Max Ivanov | 26 Aug 11:19

Re: .text_content() should leave spaces. Tests included

The way I've implementent text_content() analog. I've no idea abouth
XPath, so maybe some of checks could be implemented as XPath
processing instruction. Thats' just scratch to show an idea, no deep
testing at all but results are ok for me.

inlinetags = [ <tags list from
http://htmlhelp.com/reference/html40/inline.html> ] #except <br>

     for el in doc.iter():
         if el.text and (el.tag not in self.inlinetags):
             el.text = ''.join((' ',el.text))
         if el.tail and (el.tag not in self.inlinetags):
             el.tail += ' '
         if el.tag == 'br':
             if el.tail and not el.tail.startswith('\n'):
                 el.tail = '\n'+el.tail
             else:
                 el.tail = '\n'
             el.drop_tag()
Stefan Behnel | 26 Aug 18:11

Re: .text_content() should leave spaces. Tests included

Hi,

Max Ivanov wrote:
> The way I've implementent text_content() analog. I've no idea abouth
> XPath, so maybe some of checks could be implemented as XPath
> processing instruction. Thats' just scratch to show an idea, no deep
> testing at all but results are ok for me.
> 
> inlinetags = [ <tags list from
> http://htmlhelp.com/reference/html40/inline.html> ] #except <br>

Note that there's lxml.html.defs, which should contain what you want here.

Also, by moving the "br" test to the top in your code above, you can just
leave it in the inlinetags set.

>      for el in doc.iter():
>          if el.text and (el.tag not in self.inlinetags):
>              el.text = ''.join((' ',el.text))
>          if el.tail and (el.tag not in self.inlinetags):
>              el.tail += ' '
>          if el.tag == 'br':
>              if el.tail and not el.tail.startswith('\n'):
>                  el.tail = '\n'+el.tail
>              else:
>                  el.tail = '\n'
>              el.drop_tag()

You're modifying the tree here, which is inacceptable for a function that
returns a (partial) string serialisation. Apart from that, this seems like a
(Continue reading)

Ian Bicking | 28 Aug 19:37
Gravatar

Re: .text_content() should leave spaces. Tests included

Max Ivanov wrote:
> The way I've implementent text_content() analog. I've no idea abouth
> XPath, so maybe some of checks could be implemented as XPath
> processing instruction. Thats' just scratch to show an idea, no deep
> testing at all but results are ok for me.

Just FYI, there's some code to translate HTML to text here: 
http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py (not based on lxml).

--

-- 
Ian Bicking : ianb <at> colorstudy.com : http://blog.ianbicking.org

Gmane