Re: Incremental indexing
Olly Betts <olly <at> survex.com>
2012-03-21 01:57:31 GMT
On Tue, Mar 20, 2012 at 01:01:48PM -0400, Marios Titas wrote:
> On Tue, Mar 20, 2012 at 09:34, Olly Betts <olly <at> survex.com> wrote:
> > Yes it does pretty well, assuming you're using Xapian 1.2 or later
> > 1.0.x. This change made a huge difference to the speed of adding and
> > removing tags from indexed email in notmuch.
>
> In order for me to understand how this works, consider the following scenario:
> 1. I retrieve a document with db.get_document. Assume that it has n terms.
> 2. I add k terms with doc.add_term or doc.add_posting.
> 3. I write the document back to the database with db.replace_document.
> So are you saying that the above procedure takes O(k) time? What about
> memory, is it O(k) too? Or does xapian load the entire document in
> memory (in which case it would be O(n+k) for both time and memory)?
The list of terms indexing the document needs updating, so the operation
has complexity O(n), but that's misleading as in practice it's the time
to update the posting lists which dominates, and we only update those
where there are changes. It's really something more like O(n + K.k)
where K is fairly large.
> There is also another problem with this approach, it is not easy to
> remove terms from a document. It would be nice if the document class
> had method to decrement the wdf of a term and if the new wdf is zero
> then remove it altogether.
There's not a single method to do this, but if you are indexing with
positional information then Document::remove_posting() can decrease the
wdf of a term and Document::remove_term() can remove a term completely:
http://xapian.org/docs/apidoc/html/classXapian_1_1Document.html#11eec629737e496a7077b87f763967a5
(Continue reading)