Re: GSoC xapian node binding thoughts
Olly Betts <olly <at> survex.com>
2012-06-04 01:33:22 GMT
On Wed, May 30, 2012 at 03:34:53PM -0700, Liam wrote:
> On Tue, May 29, 2012 at 7:24 PM, Olly Betts <olly <at> survex.com> wrote:
> > If you change sizeof(Xapian::docid) (and/or the sizes of other types)
> > then that's an ABI change, so something built against xapian-core built
> > with one docid size simply won't work with xapian-core built with a
> > different docid size.
> So what happens when our lib tries to load or invoke the incompatible
> Xapian? Is it possible to prevent a crash?
Ideally someone building a modified libxapian with an incompatible ABI
would name it differently. If they don't, there isn't much we can (or
should try to) do about it at this level.
> > I doubt may people use them currently, quite possibly nobody does. But
> > that's likely to change in the foreseeable future. We're probably near
> > the point where you could conceivably build an index with this many
> > documents on commodity hardware.
> We can support more than 2^32 values by converting to double (JS type
> Number) for 2^53. But beyond that the values stop converting correctly,
> meaning we'd throw an overflow and the user would have to hack the binding
> Marius can you make a note to treat docid as a Number instead of uint32,
> and check the values from Xapian for overflow?
Perhaps it's better just to stick to an integer type, and add support
> > > Seriously, lazy-loading is oversold from what I've seen. If you have data
> > > from real-world Xapian sites that shows a material advantage for it, I'd
> > > love to read...
> > Any site searching a large Xapian database is relying heavily on lazy
> > loading.
> For an array, it's necessary, so we'll take start & count args when
> building arrays. For objects, I question the value of lazy loading, save
> for very large fields.
That's really the point - some of the things we wrap in C++ as iterators
are potentially vast. Some are certainly less likely to be, and at the
moment some are constrained in size (e.g. we hold the entire list of
terms in a particular document in memory, albeit in compressed form) but
that's just an detail of the current implementation. Providing a
consistent interface to a list of terms regardless of where it comes
from is useful in itself.
For the MSet and ESet, the list to be iterated has to be computed in
advance, and the process to compute it is fairly costly, but can be
significantly more efficient if we know up front how many entries are
actually wanted, and it's also common to want to present results as a
series of pages, so there we allow a "slice" to be specified by (start,