Eric B. Ridge | 11 Aug 2004 22:52
Favicon

Re: Rqt for Features

On Aug 11, 2004, at 4:49 PM, Olly Betts wrote:

> On Mon, Aug 09, 2004 at 12:16:59PM +0100, Olly Betts wrote:
>> On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody wrote:
>>> The cleanest method from outside of the API would be if 
>>> replace_document
>>> accepted a non-existent (to xapian) docid, in which case it adds the
>>> document rather than excepting (i.e. SQL's "REPLACE" behaviour).
>>
>> This is something I'd noticed might be useful.
>
> OK, this is now implemented in CVS.  Currently untested - I'll add some
> test cases for this shortly.  If you test it, let me know how it goes.

This is waay cool!  Using Xapian to index database records... now I'll 
be able to kill my SQL lookup table of primary key<-->xapian doc id, 
and just use my database primary keys as xapian document ids.  woo hoo!

eric
Samuel Liddicott | 11 Aug 2004 23:04

Re: Rqt for Features


Eric B.Ridge wrote:

> On Aug 11, 2004, at 4:49 PM, Olly Betts wrote:
>
>> On Mon, Aug 09, 2004 at 12:16:59PM +0100, Olly Betts wrote:
>>
>>> On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody wrote:
>>>
>>>> The cleanest method from outside of the API would be if 
>>>> replace_document
>>>> accepted a non-existent (to xapian) docid, in which case it adds the
>>>> document rather than excepting (i.e. SQL's "REPLACE" behaviour).
>>>
>>>
>>> This is something I'd noticed might be useful.
>>
>>
>> OK, this is now implemented in CVS.  Currently untested - I'll add some
>> test cases for this shortly.  If you test it, let me know how it goes.
>
>
> This is waay cool!  Using Xapian to index database records... now I'll 
> be able to kill my SQL lookup table of primary key<-->xapian doc id, 
> and just use my database primary keys as xapian document ids.  woo hoo!
>
When maintaining xapian databases this is correct.
Be aware that If you perform a search over multiple DB's that the id's 
will get munged into new id's in order to be unique over the result set.
There is (when I last heard) no simple way to convert from such munged 
(Continue reading)

Eric B. Ridge | 11 Aug 2004 23:15
Favicon

Re: Rqt for Features

On Aug 11, 2004, at 5:04 PM, Samuel Liddicott wrote:

> When maintaining xapian databases this is correct.
> Be aware that If you perform a search over multiple DB's that the id's 
> will get munged into new id's in order to be unique over the result 
> set.
> There is (when I last heard) no simple way to convert from such munged 
> id's back to the real id and real index.  This is only an issue if you 
> will be searching over combined db's.

This is a good point.  Not one I had considered.  I'm not doing this, 
but it is something for me to keep in the back of my mind.

eric
Olly Betts | 11 Aug 2004 23:15
Favicon
Gravatar

Re: Rqt for Features

On Wed, Aug 11, 2004 at 10:04:00PM +0100, Sam Liddicott wrote:
> There is (when I last heard) no simple way to convert from such munged 
> id's back to the real id and real index.

There's no API provided for this, but it's trivial to do if you know how
many databases you're searching over:

    true_docid = (multi_docid - 1) / no_of_databases + 1;

This relies on the algorithm used to map docids not changing, but I
think it's very unlikely it would.  Still, perhaps it would be worth
adding an API method for this.

Cheers,
    Olly
Eric B. Ridge | 11 Aug 2004 23:25
Favicon

Rqt for Features 2

While we're on the topic of new features...

What about (at least right-truncation) wildcard support in the 
QueryParser?  I've probably mentioned this before, but I've written my 
own java-based query parser.  I've recently expanded it to support 
taking a Database reference along with the query string:
	
	public Query parse(Database db, String query);

If the parser has a Database reference and it spots a wildcard in the 
query string, it'll lookup all the terms that match in the Database, 
and OR 'em together as a big group in the query.  This was a suggestion 
of Olly's from a long while back.

What are the odds of getting something like this hooked directly into 
Xapian?  Either the QueryParser or the Query/Matcher?

Also, quite frequently I need to simply find "all documents that do 
*not* contain term X".  I haven't been able to successfully construct a 
Query to make this happen.  If I'm remembering the API docs correctly, 
the null ctor for Query "matches no documents".  Would be useful to 
have a form that matches *all* documents.  Then you could do:

	Query q = Query(Query(ALL_DOCS), "foo", AND_NOT);

eric
Olly Betts | 11 Aug 2004 23:43
Favicon
Gravatar

Re: Rqt for Features 2

On Wed, Aug 11, 2004 at 05:25:01PM -0400, Eric B. Ridge wrote:
> What about (at least right-truncation) wildcard support in the 
> QueryParser?

It wouldn't be very hard to add, especially if you could post the code
you use.

It could prove expensive on large databases so I wonder if it should be
optional or at least allow the minimum stem length to be specified (or
perhaps the maximum number of terms to allow expansion to).  It would
be good to allow control of other features too - some people may not
want to support boolean operators for example.

> What are the odds of getting something like this hooked directly into 
> Xapian?  Either the QueryParser or the Query/Matcher?

I don't think pushing it into the matcher would allow a more efficient
implementation, so I guess it should probably just go in the QueryParser,
at least for now.

> Also, quite frequently I need to simply find "all documents that do 
> *not* contain term X".  I haven't been able to successfully construct a 
> Query to make this happen.  If I'm remembering the API docs correctly, 
> the null ctor for Query "matches no documents".  Would be useful to 
> have a form that matches *all* documents.

There isn't one currently.  The plan is that the empty term should index
all documents to allow things like this, but my patch to do this hasn't
made much progress recently I'm afraid...

(Continue reading)

Eric B. Ridge | 12 Aug 2004 01:41
Favicon

Re: Rqt for Features 2

On Aug 11, 2004, at 5:43 PM, Olly Betts wrote:

> On Wed, Aug 11, 2004 at 05:25:01PM -0400, Eric B. Ridge wrote:
>> What about (at least right-truncation) wildcard support in the
>> QueryParser?
>
> It wouldn't be very hard to add, especially if you could post the code
> you use.

it amounts to a TermIterator::skip_to() and a while() loop.  :)

> It would be good to allow control of other features too -

I'm going on vacation the first week of September (yay!).  I've got a 
long flight to my destination.  I can take a stab at implementing 
something at least equivalent to the basic matching I've done in java, 
and then we can tweak it from there...

> There isn't one currently.  The plan is that the empty term should 
> index
> all documents to allow things like this, but my patch to do this hasn't
> made much progress recently I'm afraid...

makes sense.  I'll keep my fingers crossed.

Another thing I'd like to see is range searching.  Date, number, and 
character ranges.  This could be done as another QueryParser thing 
(syntax like "1:5" or "1...5" maybe).  I haven't yet done this in my 
java-based parser but will likely be doing it soon.  What do you think 
about this?  I seem to remember reading/seeing that Omega does 
(Continue reading)

Olly Betts | 12 Aug 2004 15:11
Favicon
Gravatar

Re: Rqt for Features 2

On Wed, Aug 11, 2004 at 07:41:33PM -0400, Eric B. Ridge wrote:
> On Aug 11, 2004, at 5:43 PM, Olly Betts wrote:
> >On Wed, Aug 11, 2004 at 05:25:01PM -0400, Eric B. Ridge wrote:
> >>What about (at least right-truncation) wildcard support in the
> >>QueryParser?
> >
> >It wouldn't be very hard to add, especially if you could post the code
> >you use.
> 
> it amounts to a TermIterator::skip_to() and a while() loop.  :)

I know, but a working implementation would still save me time.

> I'm going on vacation the first week of September (yay!).  I've got a 
> long flight to my destination.  I can take a stab at implementing 
> something at least equivalent to the basic matching I've done in java, 
> and then we can tweak it from there...

That would save me more time!

> Another thing I'd like to see is range searching.  Date, number, and 
> character ranges.  This could be done as another QueryParser thing 
> (syntax like "1:5" or "1...5" maybe).

Google uses the syntax 1..5 :

http://www.google.com/help/refinesearch.html#numrange

> What do you think about this?

(Continue reading)


Gmane