Viksit Gaur | 16 May 04:57
Picon

Efficient methods to build a tree out of HTML structure?

Hi all,

I was wondering - what would be the most efficient method to access all 
the elements in the DOM tree, in some order, using lxml.etree?

The methods I currently see in the docs return a class like 
ElementDepthfirstIterator or iterwalk, which have 2 issues -

1) The first has a flat representation of the tree, so I lose 
child/parent structure

2) Things like iterwalk do return "start" and "end" actions - but 
instead of first doing an iterwalk and then parsing the results, is 
there a better way to construct the tree when iterwalk itself is running?

Or perhaps there is some method I've missed completely?

Quick note on what I'm trying to do - graphically represent the DOM 
structure of a page using a library like networkX..

Cheers,
Viksit
Stefan Behnel | 16 May 11:14
Picon

Re: Efficient methods to build a tree out of HTML structure?

Hi,

Viksit Gaur wrote:
> 2) Things like iterwalk do return "start" and "end" actions - but 
> instead of first doing an iterwalk and then parsing the results, is 
> there a better way to construct the tree when iterwalk itself is running?

I don't understand what you mean here. Are you modifying the tree during the
iteration? Or do you think of some kind of pipelining?

Stefan
Viksit Gaur | 16 May 11:28
Picon

Re: Efficient methods to build a tree out of HTML structure?

Hi,

Stefan Behnel wrote:
> Hi,
> 
> Viksit Gaur wrote:
>> 2) Things like iterwalk do return "start" and "end" actions - but 
>> instead of first doing an iterwalk and then parsing the results, is 
>> there a better way to construct the tree when iterwalk itself is running?
> 
> I don't understand what you mean here. Are you modifying the tree during the
> iteration? Or do you think of some kind of pipelining?

Hmm. The problem I face was a method to assign a unique ID to each 
element on the page.

Lets say I construct an iterwalk object. But, during this phase, I would 
like to not only build the tree, but also add some of my own information 
to each node (such as a unique ID to each element). I'm not sure how to 
do this, without extending the etree.so file inside which iterwalk is 
implemented..

Cheers,
Viksit

> 
> Stefan
> 
Stefan Behnel | 16 May 11:56
Picon

Re: Efficient methods to build a tree out of HTML structure?


Viksit Gaur wrote:
> The problem I face was a method to assign a unique ID to each
> element on the page.
> 
> Lets say I construct an iterwalk object. But, during this phase, I would
> like to not only build the tree, but also add some of my own information
> to each node (such as a unique ID to each element).

I still don't understand what you mean with "build the tree". You can't
construct a tree and run iterwalk at the same time. iterparse() will do that
in case you are parsing.

Stefan
Dennis Benzinger | 16 May 12:28
Picon

Re: Efficient methods to build a tree out of HTML structure?

Am 16.05.2008 11:56, Stefan Behnel schrieb:
> 
> Viksit Gaur wrote:
>> The problem I face was a method to assign a unique ID to each
>> element on the page.
>> 
>> Lets say I construct an iterwalk object. But, during this phase, I would
>> like to not only build the tree, but also add some of my own information
>> to each node (such as a unique ID to each element).
> 
> I still don't understand what you mean with "build the tree". You can't
> construct a tree and run iterwalk at the same time. iterparse() will do that
> in case you are parsing.
> [...]

I think he is talking about his own tree. The tree he is building to
visualize the structure of the XML data.

HTH,
Dennis Benzinger
Stefan Behnel | 16 May 12:46
Picon

Re: Efficient methods to build a tree out of HTML structure?

Hi,

Dennis Benzinger wrote:
> Am 16.05.2008 11:56, Stefan Behnel schrieb:
>> Viksit Gaur wrote:
>>> The problem I face was a method to assign a unique ID to each
>>> element on the page.
>>>
>>> Lets say I construct an iterwalk object. But, during this phase, I would
>>> like to not only build the tree, but also add some of my own information
>>> to each node (such as a unique ID to each element).
>> I still don't understand what you mean with "build the tree". You can't
>> construct a tree and run iterwalk at the same time. iterparse() will do that
>> in case you are parsing.
>> [...]
> 
> I think he is talking about his own tree. The tree he is building to
> visualize the structure of the XML data.

Ok, but if it's that, then I don't understand why iterating over the tree and
adding an id attribute to each node won't do the job.

Stefan

Gmane