Jochen Wiedmann | 2 Aug 2003 02:18
Picon

Re: Re: XMLDB API

Quoting Kevin O'Neill <kevin <at> rocketred.com.au>:

> Actually exist does this quiet well now in it's XML-RPC driver. Internally
> it just stores the list of node ids matched. This keeps the result
> document reasonably small and allows the results to be lazy loaded so, for
> example, if you are paging your results you will only wear the cost of the
> 5 documents you actaully call "next()" on.

I cannot see how this fixes the problem. It only increases the critical
limit. Besides, it would obviously be more performant to push out (and
read) the documents in streaming mode rather than fetching it one by one,
depending on the ID.

> The lack of iterators in methods like Collection.listCollections and
> Collection.listResources is a problem when the collection size get large
> (I'm looking at a test that involves adding over 1,000,000 documents to a
> collection. Getting the list back involves a java array of 1,000,000
> elements :S).

I found it critical with far less documents. On a heavily loaded web
server with 10 simultaneous requests and a total response size of 5 MB
each you'll see the result easily. The same server behaved astonishingly
well as soon as I changed its internals to pure streaming mode.

Jochen

-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
(Continue reading)

Kevin O'Neill | 2 Aug 2003 02:48
Picon

Re: Re: XMLDB API

On Sat, 02 Aug 2003 02:18:49 +0200, Jochen Wiedmann wrote:

> Quoting Kevin O'Neill <kevin <at> rocketred.com.au>:
> 
>> Actually exist does this quiet well now in it's XML-RPC driver.
>> Internally it just stores the list of node ids matched. This keeps the
>> result document reasonably small and allows the results to be lazy
>> loaded so, for example, if you are paging your results you will only
>> wear the cost of the 5 documents you actaully call "next()" on.
> 
> I cannot see how this fixes the problem. It only increases the critical
> limit. Besides, it would obviously be more performant to push out (and
> read) the documents in streaming mode rather than fetching it one by
> one, depending on the ID.

Why is it obviously more performant? The driver can always page the
results keeping the balance between access lattency and memory consumption
to a reasonable level. The user may be allowed to tune these parameters by
setting things like the page size etc. Access to the pages may or may not
involve establishing a new connection, it is quiet possible with http keep
alive enabled to hold the connection open. In this case there is a small
overhead for actually sending the request but you avoid the costly
connection establishment phase.

The result set needs to be deterministic and cannot increase or decrease
during eXist solves this by listing the nodes that matched at the time of
the call (of course it is still possible for the node to be removed before
you calll next()).

How does the push model work in a stateless environment like http?
(Continue reading)

Jochen Wiedmann | 3 Aug 2003 12:39
Picon

Re: Re: Re: XMLDB API

Quoting Kevin O'Neill <kevin <at> rocketred.com.au>:

> Why is it obviously more performant?

The driver cannot do less work than simply piping results through to
the user at the very moment when they arive. No memory required (in the
case of SAX events). No object generation. No administrative work. It's
that simple. (Always think of the trivial processing task to write the
result set into a servlets output stream. In this case the attached SAX
handler might be a trivial XML Writer or something similar.)

> How does the push model work in a stateless environment like http?
> Couldn't the push be laid over and iterator? How do I implement a skip
> type process, eg I only want results 6 though 10?

What is the problem with HTTP? The HTTP response contains a single
result set, no matter how large. The result set is typically returned
as a single, large XML document. (Best for performance, it may also
be a set of documents, for example in a MIME multipart document, in
which case an XML parser has to be invoked for any result document,
which is slower.) The driver typically creates a SAX parser consuming
the input stream and attaches a SAX ContentHandler parsing the response.
The SAX handler checks for error conditions and document boundaries,
forwarding SAX events to the user.

> Is the connection to the store local or remote?

Whatever you want. In my personal case it has been a Tamino database
engine and the protocol was the proprietary Tamino response document,
but that's not too different from an XML-RPC or SOAP response besides
(Continue reading)

Kevin O'Neill | 4 Aug 2003 01:08
Picon

Re: Re: Re: XMLDB API

>> Why is it obviously more performant?
> 
> The driver cannot do less work than simply piping results through to the
> user at the very moment when they arive. No memory required (in the case
> of SAX events). No object generation. No administrative work. It's that
> simple. (Always think of the trivial processing task to write the result
> set into a servlets output stream. In this case the attached SAX handler
> might be a trivial XML Writer or something similar.)

An iterator gives the driver developer control over the document
instantiation point.  

>> How does the push model work in a stateless environment like http?
>> Couldn't the push be laid over and iterator? How do I implement a skip
>> type process, eg I only want results 6 though 10?
> 
> What is the problem with HTTP? The HTTP response contains a single result
> set, no matter how large.

Ahh ... but this is what you normally want to avoid. I recently had the
displesure of working with a system that returned a set of files this
way. It was trivial run the client out of memory by simply calling a large
enough result set. Yes there are work arounds (otherwise we would not have
been able to deliver the system).

Given what you have said I believe that your callback interace would be
pretty easy to lay ontop of the iterator interface. 

Having said that. I think that the ability to handle a resultset via an
event stream is a good idea. Thanks for taking the time to discuss this
(Continue reading)

Jochen Wiedmann | 4 Aug 2003 02:16
Picon

Re: Re: Re: Re: XMLDB API

Quoting Kevin O'Neill <kevin <at> rocketred.com.au>:

> An iterator gives the driver developer control over the document
> instantiation point.  

No. Again, I keep talking about the case where I am parsing a SOAP
or XML-RPC response including multiple documents.

The driver developer is parsing an XML document. His task is to split
the whole response into multiple smaller pieces. For the sake of
performance, he must not convert the large document into an object
tree and work on that. Rather he will create a SAX handler or use an
XML pull parser and some simple state engine that allows to detect
document boundaries. This will always stay the same.

> Ahh ... but this is what you normally want to avoid.

True. I *normally* want to avoid it. My proposal was to keep the iterator
API as it is, but extend it with an event model that allows streaming.

> I recently had the
> displesure of working with a system that returned a set of files this
> way. It was trivial run the client out of memory by simply calling a large
> enough result set. Yes there are work arounds (otherwise we would not have
> been able to deliver the system).

I would assume that there was some kind of error. Streaming is important
right for that case, because it avoids to store the result set in memory.
My original point was that the iterator API *forces* storing the
result set in memory in certain cases. (When parsing a large XML document
(Continue reading)


Gmane