Marios Titas | 20 Mar 07:01 2012
Picon

Incremental indexing

Hi all,

I am trying to implement an Incremental indexing scheme. The problem
is that usually the modified documents are large but the modifications
are limited. Ideally, I would like to reindex only the modified parts
of these documents. If I am not mistaken, xapian cannot do that. Are
there any other approaches?

It would be nice if xapian supported something like the SQL "group
by". If it did, then it would be possible to break large documents
into several pieces which could be indexed independently. When
querying, these pieces would be then combined again using some
aggregate function similar to the SQL function sum.

Thanks
Jean-Francois Dockes | 20 Mar 10:36 2012

Re: Incremental indexing

Marios Titas writes:
 > Hi all,
 > 
 > I am trying to implement an Incremental indexing scheme. The problem
 > is that usually the modified documents are large but the modifications
 > are limited. Ideally, I would like to reindex only the modified parts
 > of these documents. If I am not mistaken, xapian cannot do that. Are
 > there any other approaches?
 > 
 > It would be nice if xapian supported something like the SQL "group
 > by". If it did, then it would be possible to break large documents
 > into several pieces which could be indexed independently. When
 > querying, these pieces would be then combined again using some
 > aggregate function similar to the SQL function sum.

Hi,

The Recoll Xapian-based desktop indexer implements the "break into pieces"
part for big text files. This is done so that the appropriate section of
the document can be loaded for previewing (useful for, ie, big log files).

It doesn't implement independant incremental re-indexing though because it
has no way to know which parts may have changed.

The document parts are linked by a common parent identifier which can be
used to get to the whole document. There is both an entry in the document
data record, used to get to the parent of a result document, and a "parent"
unique term for each part, used to find all the parts of a given parent
document (useful for deleting for example).

(Continue reading)

Sym Roe | 20 Mar 10:41 2012
Picon

Re: Incremental indexing

On Tue, Mar 20, 2012 at 6:01 AM, Marios Titas <redneb8888 <at> gmail.com> wrote:
> Hi all,
>
> I am trying to implement an Incremental indexing scheme. The problem
> is that usually the modified documents are large but the modifications
> are limited. Ideally, I would like to reindex only the modified parts
> of these documents. If I am not mistaken, xapian cannot do that. Are
> there any other approaches?

I don't know for sure, but I expect xapian is quite good at only
updating the updated parts of a document (in terms of what it writes
to disk).

> It would be nice if xapian supported something like the SQL "group
> by". If it did, then it would be possible to break large documents
> into several pieces which could be indexed independently. When
> querying, these pieces would be then combined again using some
> aggregate function similar to the SQL function sum.

Xapian::Enquire has set_collapse_key:
http://xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html#117ee547f5908e952e2e72d5a986d3bb

--

-- 
E: sym.roe <at> talusdesign.co.uk
M: 07742079314
 <at> symroe
Olly Betts | 20 Mar 14:34 2012

Re: Incremental indexing

On Tue, Mar 20, 2012 at 09:41:31AM +0000, Sym Roe wrote:
> On Tue, Mar 20, 2012 at 6:01 AM, Marios Titas <redneb8888 <at> gmail.com> wrote:
> > I am trying to implement an Incremental indexing scheme. The problem
> > is that usually the modified documents are large but the modifications
> > are limited. Ideally, I would like to reindex only the modified parts
> > of these documents. If I am not mistaken, xapian cannot do that. Are
> > there any other approaches?
> 
> I don't know for sure, but I expect xapian is quite good at only
> updating the updated parts of a document (in terms of what it writes
> to disk).

Yes it does pretty well, assuming you're using Xapian 1.2 or later
1.0.x.  This change made a huge difference to the speed of adding and
removing tags from indexed email in notmuch.

Cheers,
    Olly
Marios Titas | 20 Mar 18:01 2012
Picon

Re: Incremental indexing

On Tue, Mar 20, 2012 at 09:34, Olly Betts <olly <at> survex.com> wrote:
> Yes it does pretty well, assuming you're using Xapian 1.2 or later
> 1.0.x.  This change made a huge difference to the speed of adding and
> removing tags from indexed email in notmuch.

In order for me to understand how this works, consider the following scenario:
1. I retrieve a document with db.get_document. Assume that it has n terms.
2. I add k terms with doc.add_term or doc.add_posting.
3. I write the document back to the database with db.replace_document.
So are you saying that the above procedure takes O(k) time? What about
memory, is it O(k) too? Or does xapian load the entire document in
memory (in which case it would be O(n+k) for both time and memory)?

There is also another problem with this approach, it is not easy to
remove terms from a document. It would be nice if the document class
had method to decrement the wdf of a term and if the new wdf is zero
then remove it altogether.
Olly Betts | 21 Mar 02:57 2012

Re: Incremental indexing

On Tue, Mar 20, 2012 at 01:01:48PM -0400, Marios Titas wrote:
> On Tue, Mar 20, 2012 at 09:34, Olly Betts <olly <at> survex.com> wrote:
> > Yes it does pretty well, assuming you're using Xapian 1.2 or later
> > 1.0.x.  This change made a huge difference to the speed of adding and
> > removing tags from indexed email in notmuch.
> 
> In order for me to understand how this works, consider the following scenario:
> 1. I retrieve a document with db.get_document. Assume that it has n terms.
> 2. I add k terms with doc.add_term or doc.add_posting.
> 3. I write the document back to the database with db.replace_document.
> So are you saying that the above procedure takes O(k) time? What about
> memory, is it O(k) too? Or does xapian load the entire document in
> memory (in which case it would be O(n+k) for both time and memory)?

The list of terms indexing the document needs updating, so the operation
has complexity O(n), but that's misleading as in practice it's the time
to update the posting lists which dominates, and we only update those
where there are changes.  It's really something more like O(n + K.k)
where K is fairly large.

> There is also another problem with this approach, it is not easy to
> remove terms from a document. It would be nice if the document class
> had method to decrement the wdf of a term and if the new wdf is zero
> then remove it altogether.

There's not a single method to do this, but if you are indexing with
positional information then Document::remove_posting() can decrease the
wdf of a term and Document::remove_term() can remove a term completely:

http://xapian.org/docs/apidoc/html/classXapian_1_1Document.html#11eec629737e496a7077b87f763967a5
(Continue reading)

Marios Titas | 22 Mar 09:50 2012
Picon

Re: Incremental indexing

On Tue, Mar 20, 2012 at 21:57, Olly Betts <olly <at> survex.com> wrote:
> So you could check if the wdf goes to zero and call remove_term() if it
> does.

If I am not mistaken there is no fast way to find a term in a specific
document. So you have to go through all terms in the document and
remove the ones with a wdf equal to zero. Is there any other better
way?

Olly, do you have any thoughts about the other idea, the one about
group by? Let me give you an example so that you can see what I mean:
Suppose that you have a data set about some music albums.
Specifically, for each album you have information such as the name of
the artist or the titles of the songs but you also have a list of user
reviews. The challenge of course is how to update the search indices
efficiently when a new review is posted. With the group by + aggregate
function approach you could just index each review as a separate
document and include the album id as a value. When querying you would
have to group the documents by the album id and return one result per
group having a weight that would take into account the grouping. This
has the additional advantage that you could also search for individual
user reviews (i.e. w/o doing any grouping) as well as for albums.
Sym Roe | 22 Mar 10:57 2012
Picon

Re: Incremental indexing

On Thu, Mar 22, 2012 at 8:50 AM, Marios Titas <redneb8888 <at> gmail.com> wrote:
> document and include the album id as a value. When querying you would
> have to group the documents by the album id and return one result per
> group having a weight that would take into account the grouping. This
> has the additional advantage that you could also search for individual
> user reviews (i.e. w/o doing any grouping) as well as for albums.

Sorry to bring this up again, but can you explain why you can't use
set_collapse_key in this case?

From the docs:

"Another use is to group matches in a particular category (e.g. you
might collapse a mailing list search on the Subject: so that there's
only one result per discussion thread). In this case you can use
get_collapse_count() to give the user some idea how many other results
there are. And if you index the Subject: as a boolean term as well as
putting it in a value, you can offer a link to a non-collapsed search
restricted to that thread using a boolean filter."

http://xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html#117ee547f5908e952e2e72d5a986d3bb

This sounds very like the case you're talking about.

--

-- 
E: sym.roe <at> talusdesign.co.uk
M: 07742079314
 <at> symroe

Gmane