Dieter Maurer | 23 Jun 2012 12:02
Picon

Support integration with other tree changing libxml2 based libraries

I am working on an integration of `lxml` and `libxmlsec` (the
XML security library) and I have hit an important problem:
`libxmlsec` functions can change the libxml2 document (tree)
and thereby seriously confuse `lxml`.

The major problem is that `libxmlsec` may unlink and release subtrees
leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees.
Fortunately, `libxmlsec` can be told not to release unlinked
subtrees but leave that to the application. But now, my application
must do that: release the subtree if and only if `lxml` will not do
that at a later time (because it has a reference to some node in the subtree).
Looking at the public `lxml` API, I have not found
such a function. I have come up with the following first version
of an `lxml_safe_release`:

cdef int lxml_safe_release(_Document doc, xmlNode* c_node) except -1:
  # we let `lxml` get rid of the subtree by wrapping *c_node* into a
  #  proxy and then releasing it again.
  if elementFactory(doc, c_node) == NULL: return -1
  return 0

I hope that this will be sufficient to prevent SIGSEGV.
However, I doubt that it is already enough that references into
unlinked subtrees really work correctly. In similar situations,
`lxml` calls `moveNodeToDocument` in order to get namespace references
inside the unlinked subtree self contained. `moveNodeToDocument` is not
public and far to complicated that I would like to include a copy
in my code.

I propose that future `lxml` versions should include a public
(Continue reading)

Stefan Behnel | 23 Jun 2012 15:57
Picon
Favicon

Re: Support integration with other tree changing libxml2 based libraries

Dieter Maurer, 23.06.2012 12:02:
> I am working on an integration of `lxml` and `libxmlsec` (the
> XML security library)

Cool. I'm sure that a lot of people will be happy about this.

> and I have hit an important problem:
> `libxmlsec` functions can change the libxml2 document (tree)
> and thereby seriously confuse `lxml`.

I can imagine. lxml's speed is built upon a couple of assumptions about the
tree, including that it can figure out when a tree must be discarded from
memory based on the Python proxy Elements it finds in it.

> The major problem is that `libxmlsec` may unlink and release subtrees
> leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees.
> Fortunately, `libxmlsec` can be told not to release unlinked
> subtrees but leave that to the application.

Hmm - but if they are getting unlinked from the tree, how do you find them?
Does libxmlsec have a callback for this?

> But now, my application
> must do that: release the subtree if and only if `lxml` will not do
> that at a later time (because it has a reference to some node in the subtree).
> Looking at the public `lxml` API, I have not found
> such a function.

The public C-API of lxml is mostly grown based on the needs of
lxml.objectify. It may eventually grow further based on other requirements.
(Continue reading)

Dieter Maurer | 24 Jun 2012 11:09
Picon

Re: Support integration with other tree changing libxml2 based libraries

Stefan Behnel <stefan_ml <at> behnel.de> writes:
> Dieter Maurer, 23.06.2012 12:02:
> ...
>> I propose that future `lxml` versions should include a public
>> `safe_release` function for such purposes.
>
> Maybe a new "removeNodeFromDocument()" API function could first check for
> proxies, and then either deallocate or fix up the tree to be stand-alone.

That would be ideal.

> ...
>> Another, but less serious problem: some `libxmlsec` functions
>> replace a node inside the tree (e.g. a node is replaced by an
>> `EncryptedData` node representing the node in an encrypted form).
>> It would be nice if I could "retarget" an `lxml` proxy referencing
>> the replaced node to point to the replacing node. This way,
>> `lxml` objects with references to the proxy would see the new
>> state rather then the confusing picture resulting from the proxy
>> now refering to an unlinked node.
> ...
>> Of course, the "retarget"ing is not trivial. It is not sufficient
>> to give the proxy a new "_c_node"; its class, too, might need to
>> be adapted. This were possible as long as the two classes
>> had the same "C" layout for their objects. Is `lxml` supposed
>> to support proxy classes with differing "C" layout (I expect "yes"
>> as answer).
>
> From the POV of lxml the proxy is just a reference to an object of type (or
> subtype of) _Element. The problem is that the user most likely holds
(Continue reading)

Dieter Maurer | 24 Jun 2012 20:34
Picon

Re: Support integration with other tree changing libxml2 based libraries

Stefan Behnel <stefan_ml <at> behnel.de> writes:
> Dieter Maurer, 23.06.2012 12:02:
> ...
>> The major problem is that `libxmlsec` may unlink and release subtrees
>> leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees.
>> Fortunately, `libxmlsec` can be told not to release unlinked
>> subtrees but leave that to the application.
>
> Hmm - but if they are getting unlinked from the tree, how do you find them?
> Does libxmlsec have a callback for this?

I have not answered this question in my previous response because
I thought it not relevant -- but maybe, I have been wrong.

`libxmlsec` does not provide a callback but it provides an option
that instead of doing the release itself internally it makes the
unlinked subtrees available on the (context) object controlling the operation.
The consequence: these subtrees are already unlinked; they still have
their `doc` reference but they already lost their `parent` reference
(and likely other references related to their former position in the tree).

And one more (most likely irrelevant) detail:
`libxmlsec` provides the unlinked subtrees
in the form of an `xmlNodeList *`, i.e. the `next` pointer of
the individual subtree roots may still point somewhere (as part of the
node list).

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
(Continue reading)


Gmane