Olly Betts | 28 May 2012 23:44
Favicon
Gravatar

GSoC xapian node binding thoughts

I was just having a look over the API notes:

https://github.com/mtibeica/node-xapian/blob/master/docs

Some feedback:

I wouldn't bother wrapping WritableDatabase::flush().  It's only there
for compatibility with older code, so for a new binding you can just
wrap commit().

Generally, uint32 isn't necessarily the right type to use everywhere,
and means things will go wrong if someone patches Xapian and rebuilds
it to use (e.g. 64 bit document ids).  Maybe it's hard to use the
appropriate Xapian::docid, Xapian::doccount, Xapian::termcount, etc
typedefs here though.

    A query consisting of two or more subqueries, opp-ed together.
    AND, OR, SYNONYM, NEAR and PHRASE can take any number of subqueries. 
    Other operators take only the first two subqueries.
    {
	op: string,
	queries: [ object_querystructure1, ...]
    }

XOR can also take any number of subqueries.  And on trunk, OP_FILTER,
OP_AND_NOT, and OP_AND_MAYBE can also take any number of subqueries
(with OP(A, B, C) being interpreted as OP(OP(A, B), C)

Also, it would be nice to support a mixture of strings and query objects
as the subqueries (like we do in most of the dynamically typed languages).
(Continue reading)

Liam | 29 May 2012 07:23

Re: GSoC xapian node binding thoughts


On Mon, May 28, 2012 at 2:44 PM, Olly Betts <olly <at> survex.com> wrote:

Generally, uint32 isn't necessarily the right type to use everywhere,
and means things will go wrong if someone patches Xapian and rebuilds
it to use (e.g. 64 bit document ids). 

Javascript doesn't currently support int64. It goes up to 2^53. We should probably raise an error if the Xapian build we're running against uses int64 doc ids.

XOR can also take any number of subqueries.  And on trunk, OP_FILTER,
OP_AND_NOT, and OP_AND_MAYBE can also take any number of subqueries
(with OP(A, B, C) being interpreted as OP(OP(A, B), C)

XOR is missing from the online docs for Query(Query::op, Iterator, Iterator, termcount)

We can include support for the other ops you mention and leave it commented out for now.

Marius, that Query object is missing a parameter:uint32 member.

Also, it would be nice to support a mixture of strings and query objects
as the subqueries (like we do in most of the dynamically typed languages).

You can include a term query in the list by writing {tname:'string'}, but certainly we could let 'string' be a shorthand for that.

I'm dubious about wrapping the various iterators as methods which read
all the entries from the iterator and return an array. 

 We'll take optional start & count arguments for those guys.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
James Aylett | 29 May 2012 09:32

Re: GSoC xapian node binding thoughts

I'd favour either complaining about overflow of numbers rather than types, or auto-convert to strings so it's a bit more like using twitter snowflake ids. Or I guess something like the math.Long trick from Closure.

Any idea when v8 will get 64 bit support? Or is this not on the roadmap at all?

On arrays, is there no way of deferring comprehension until later? So you could pass an array-like object around and slice it later.

J

On 29 May 2012, at 06:23, Liam <xapian <at> networkimprov.net> wrote:


On Mon, May 28, 2012 at 2:44 PM, Olly Betts <olly <at> survex.com> wrote:

Generally, uint32 isn't necessarily the right type to use everywhere,
and means things will go wrong if someone patches Xapian and rebuilds
it to use (e.g. 64 bit document ids). 

Javascript doesn't currently support int64. It goes up to 2^53. We should probably raise an error if the Xapian build we're running against uses int64 doc ids.

XOR can also take any number of subqueries.  And on trunk, OP_FILTER,
OP_AND_NOT, and OP_AND_MAYBE can also take any number of subqueries
(with OP(A, B, C) being interpreted as OP(OP(A, B), C)

XOR is missing from the online docs for Query(Query::op, Iterator, Iterator, termcount)

We can include support for the other ops you mention and leave it commented out for now.

Marius, that Query object is missing a parameter:uint32 member.

Also, it would be nice to support a mixture of strings and query objects
as the subqueries (like we do in most of the dynamically typed languages).

You can include a term query in the list by writing {tname:'string'}, but certainly we could let 'string' be a shorthand for that.

I'm dubious about wrapping the various iterators as methods which read
all the entries from the iterator and return an array. 

 We'll take optional start & count arguments for those guys.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Liam | 29 May 2012 10:37

Re: GSoC xapian node binding thoughts



On Tue, May 29, 2012 at 12:32 AM, James Aylett <james-xapian <at> tartarus.org> wrote:
I'd favour either complaining about overflow of numbers rather than types, or auto-convert to strings so it's a bit more like using twitter snowflake ids. Or I guess something like the math.Long trick from Closure.

Isn't it poor form to work for a time and then overflow without warning? As for a workaround, I'd want to know how users need to manipulate doc ids in JS before choosing a hack...

Any idea when v8 will get 64 bit support? Or is this not on the roadmap at all?

They're under discussion for the next draft of the ECMA standard, I believe.

On arrays, is there no way of deferring comprehension until later? So you could pass an array-like object around and slice it later.

Comprehension? Slice? Sorry I've not run across those terms in this context.

Once can define "getter" function members in an object or prototype (class); accessing the member implicitly invokes the function. However we wouldn't use that mechanism to perform I/O, as getters only function synchronously.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
James Aylett | 29 May 2012 18:03

Re: GSoC xapian node binding thoughts

I'd say it's better than to refuse to compile, although it's somewhat moot right now. All numbers will overflow eventually, although I assume in Node yo'd just get IEEE rounding behaviour? Basically, there's no nice solution.

 Slicing / comprehension I mean you return an object that will act as an array but which doesn't read out the data at that point. Obviously has interesting interactions with the async model of Node, but having to think about paging seems like a step back from nice abstractions to me :-(

J



On 29 May 2012, at 09:37, Liam <xapian <at> networkimprov.net> wrote:



On Tue, May 29, 2012 at 12:32 AM, James Aylett <james-xapian <at> tartarus.org> wrote:
I'd favour either complaining about overflow of numbers rather than types, or auto-convert to strings so it's a bit more like using twitter snowflake ids. Or I guess something like the math.Long trick from Closure.

Isn't it poor form to work for a time and then overflow without warning? As for a workaround, I'd want to know how users need to manipulate doc ids in JS before choosing a hack...

Any idea when v8 will get 64 bit support? Or is this not on the roadmap at all?

They're under discussion for the next draft of the ECMA standard, I believe.

On arrays, is there no way of deferring comprehension until later? So you could pass an array-like object around and slice it later.

Comprehension? Slice? Sorry I've not run across those terms in this context.

Once can define "getter" function members in an object or prototype (class); accessing the member implicitly invokes the function. However we wouldn't use that mechanism to perform I/O, as getters only function synchronously.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Liam | 29 May 2012 20:45

Re: GSoC xapian node binding thoughts



On Tue, May 29, 2012 at 9:03 AM, James Aylett <james-xapian <at> tartarus.org> wrote:
I'd say it's better than to refuse to compile, although it's somewhat moot right now. All numbers will overflow eventually, although I assume in Node yo'd just get IEEE rounding behaviour? Basically, there's no nice solution.

It would be a runtime check, not compile-time. We'd compile against a suitably configured Xapian :-)

In what context are int64 doc ids necessary? What % of installations use them?

 Slicing / comprehension I mean you return an object that will act as an array but which doesn't read out the data at that point. Obviously has interesting interactions with the async model of Node, but having to think about paging seems like a step back from nice abstractions to me :-(

Well you know what they say about "nice abstractions"... The road to Hell is paved with them :-)

Seriously, lazy-loading is oversold from what I've seen. If you have data from real-world Xapian sites that shows a material advantage for it, I'd love to read...
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 30 May 2012 04:24
Favicon
Gravatar

Re: GSoC xapian node binding thoughts

On Tue, May 29, 2012 at 11:45:34AM -0700, Liam wrote:
> On Tue, May 29, 2012 at 9:03 AM, James Aylett <james-xapian <at> tartarus.org>wrote:
> > I'd say it's better than to refuse to compile, although it's somewhat moot
> > right now. All numbers will overflow eventually, although I assume in Node
> > yo'd just get IEEE rounding behaviour? Basically, there's no nice solution.
> 
> It would be a runtime check, not compile-time. We'd compile against a
> suitably configured Xapian :-)

If you change sizeof(Xapian::docid) (and/or the sizes of other types)
then that's an ABI change, so something built against xapian-core built
with one docid size simply won't work with xapian-core built with a
different docid size.

> In what context are int64 doc ids necessary? What % of installations use
> them?

They're obviously necessary if you have more than 4 billion documents.
You can also hit the limit sooner if you search several databases
together and the sizes are uneven (as the docids get interleaved).  They
are also handy if you have an external system with a numeric id which is
wider than 32 bits.

I doubt may people use them currently, quite possibly nobody does.  But
that's likely to change in the foreseeable future.  We're probably near
the point where you could conceivably build an index with this many
documents on commodity hardware.

I was really just trying to check that the issue had been considered, as
unnecessarily hard-wiring in an assumption that these quantities are
32 bit would be short-sighted.

> Seriously, lazy-loading is oversold from what I've seen. If you have data
> from real-world Xapian sites that shows a material advantage for it, I'd
> love to read...

Any site searching a large Xapian database is relying heavily on lazy
loading.

Cheers,
    Olly
Liam | 31 May 2012 00:34

Re: GSoC xapian node binding thoughts

On Tue, May 29, 2012 at 7:24 PM, Olly Betts <olly <at> survex.com> wrote:
If you change sizeof(Xapian::docid) (and/or the sizes of other types)
then that's an ABI change, so something built against xapian-core built
with one docid size simply won't work with xapian-core built with a
different docid size.

So what happens when our lib tries to load or invoke the incompatible Xapian? Is it possible to prevent a crash?

> In what context are int64 doc ids necessary? What % of installations use
> them?

I doubt may people use them currently, quite possibly nobody does.  But
that's likely to change in the foreseeable future.  We're probably near
the point where you could conceivably build an index with this many
documents on commodity hardware.

We can support more than 2^32 values by converting to double (JS type Number) for 2^53. But beyond that the values stop converting correctly, meaning we'd throw an overflow and the user would have to hack the binding himself.

Marius can you make a note to treat docid as a Number instead of uint32, and check the values from Xapian for overflow?

> Seriously, lazy-loading is oversold from what I've seen. If you have data
> from real-world Xapian sites that shows a material advantage for it, I'd
> love to read...

Any site searching a large Xapian database is relying heavily on lazy
loading.

For an array, it's necessary, so we'll take start & count args when building arrays. For objects, I question the value of lazy loading, save for very large fields.

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 4 Jun 2012 03:33
Favicon
Gravatar

Re: GSoC xapian node binding thoughts

On Wed, May 30, 2012 at 03:34:53PM -0700, Liam wrote:
> On Tue, May 29, 2012 at 7:24 PM, Olly Betts <olly <at> survex.com> wrote:
> 
> > If you change sizeof(Xapian::docid) (and/or the sizes of other types)
> > then that's an ABI change, so something built against xapian-core built
> > with one docid size simply won't work with xapian-core built with a
> > different docid size.
> 
> So what happens when our lib tries to load or invoke the incompatible
> Xapian? Is it possible to prevent a crash?

Ideally someone building a modified libxapian with an incompatible ABI
would name it differently.  If they don't, there isn't much we can (or
should try to) do about it at this level.

> > I doubt may people use them currently, quite possibly nobody does.  But
> > that's likely to change in the foreseeable future.  We're probably near
> > the point where you could conceivably build an index with this many
> > documents on commodity hardware.
> 
> We can support more than 2^32 values by converting to double (JS type
> Number) for 2^53. But beyond that the values stop converting correctly,
> meaning we'd throw an overflow and the user would have to hack the binding
> himself.
> 
> Marius can you make a note to treat docid as a Number instead of uint32,
> and check the values from Xapian for overflow?

Perhaps it's better just to stick to an integer type, and add support
for 64 bits integers later if/when javascript supports them itself.

> > > Seriously, lazy-loading is oversold from what I've seen. If you have data
> > > from real-world Xapian sites that shows a material advantage for it, I'd
> > > love to read...
> >
> > Any site searching a large Xapian database is relying heavily on lazy
> > loading.
> 
> For an array, it's necessary, so we'll take start & count args when
> building arrays. For objects, I question the value of lazy loading, save
> for very large fields.

That's really the point - some of the things we wrap in C++ as iterators
are potentially vast.  Some are certainly less likely to be, and at the
moment some are constrained in size (e.g. we hold the entire list of
terms in a particular document in memory, albeit in compressed form) but
that's just an detail of the current implementation.  Providing a
consistent interface to a list of terms regardless of where it comes
from is useful in itself.

For the MSet and ESet, the list to be iterated has to be computed in
advance, and the process to compute it is fairly costly, but can be
significantly more efficient if we know up front how many entries are
actually wanted, and it's also common to want to present results as a
series of pages, so there we allow a "slice" to be specified by (start,
length).

Cheers,
    Olly
Olly Betts | 30 May 2012 03:19
Favicon
Gravatar

Re: GSoC xapian node binding thoughts

On Mon, May 28, 2012 at 10:23:12PM -0700, Liam wrote:
> On Mon, May 28, 2012 at 2:44 PM, Olly Betts <olly <at> survex.com> wrote:
> > XOR can also take any number of subqueries.  And on trunk, OP_FILTER,
> > OP_AND_NOT, and OP_AND_MAYBE can also take any number of subqueries
> > (with OP(A, B, C) being interpreted as OP(OP(A, B), C)
> 
> XOR is missing from the online docs for Query(Query::op, Iterator,
> Iterator, termcount)

Thanks for pointing that out - it's been wrong for a while then (since
r3194, 2001-02-26):

    * Some modifications to XOR handling: should now behave like OR and 
      AND - doesn't need to be binary.  (*untested*) 

Now fixed on the 1.2 branch.  It was also missing that ELITE_SET can
take any number of subqueries in that comment, though it clearly says
it can elsewhere, and ELITE_SET would be rather useless if it only took
2 subqueries...

> We can include support for the other ops you mention and leave it commented
> out for now.

I would strongly recommend developing against trunk at this point
anyway.  You don't want to be wrapping anything which has been
deprecated in the C++ API, and it would be good to have wrappers done
for new features.  Once you have trunk wrapped, tweaking the wrappers to
work against 1.2 should be a simple matter of disabling a few parts.

> > Also, it would be nice to support a mixture of strings and query objects
> > as the subqueries (like we do in most of the dynamically typed languages).
> 
> You can include a term query in the list by writing {tname:'string'}, but
> certainly we could let 'string' be a shorthand for that.

It's largely syntactic sugar, but even syntactic sugar is still sweet.

Cheers,
    Olly
Liam | 30 May 2012 03:33

Re: GSoC xapian node binding thoughts


On Tue, May 29, 2012 at 6:19 PM, Olly Betts <olly <at> survex.com> wrote:
On Mon, May 28, 2012 at 10:23:12PM -0700, Liam wrote:
> On Mon, May 28, 2012 at 2:44 PM, Olly Betts <olly <at> survex.com> wrote:
> > XOR can also take any number of subqueries.  And on trunk, OP_FILTER,
> > OP_AND_NOT, and OP_AND_MAYBE can also take any number of subqueries
> > (with OP(A, B, C) being interpreted as OP(OP(A, B), C)
>
> XOR is missing from the online docs for Query(Query::op, Iterator,
> Iterator, termcount)

Now fixed on the 1.2 branch.  It was also missing that ELITE_SET can
take any number of subqueries in that comment, though it clearly says
it can elsewhere, and ELITE_SET would be rather useless if it only took
2 subqueries...

Can I suggest you push an update to the online docs? Actually it'd be nice if those docs included changes on trunk, flagged in some way.

> We can include support for the other ops you mention and leave it commented
> out for now.

I would strongly recommend developing against trunk at this point
anyway. 

Is there a trunk package? I seem to recall it's a lot of stuff to install & configure to build everything. That aspect of development makes me grumpy :-)


_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 30 May 2012 04:39
Favicon
Gravatar

Re: GSoC xapian node binding thoughts

On Tue, May 29, 2012 at 06:33:58PM -0700, Liam wrote:
> On Tue, May 29, 2012 at 6:19 PM, Olly Betts <olly <at> survex.com> wrote:
> 
> > Now fixed on the 1.2 branch.  It was also missing that ELITE_SET can
> > take any number of subqueries in that comment, though it clearly says
> > it can elsewhere, and ELITE_SET would be rather useless if it only took
> > 2 subqueries...
> 
> Can I suggest you push an update to the online docs?

Things aren't currently set up to make that easy to do, so I'll leave it
until 1.2.11 is released.  Since nobody noticed for 10 years, another
month doesn't seem a big deal.

> Actually it'd be nice if those docs included changes on trunk, flagged
> in some way.

I try to flag up major changes, but it's quite time intensive to revise
the 1.2 docs in the light of changes on trunk.  Patches welcome if
anyone wants to help.

> > I would strongly recommend developing against trunk at this point
> > anyway.
> 
> Is there a trunk package?

There's a 1.3.0 development snapshot, and you can download snapshots
which track trunk from here: http://oligarchy.co.uk/xapian/trunk/

But for a case like this, I'd recommend checking out the code from git,
as it's much simpler to stay current.

> I seem to recall it's a lot of stuff to install & configure to build
> everything. That aspect of development makes me grumpy :-)

It's pretty streamlined these days.  For Debian and Ubuntu, HACKING even
includes the apt-get command to install the packages you need.

Then it's just:

./bootstrap
./configure
make

Cheers,
    Olly

Gmane