Roman Chyla | 17 Jul 2012 18:44
Picon

TermEnum.docFreq() includes deleted docs

Hi,

Tests show that TermEnum.docFreq() returns sum of all docs, including
the deleted ones. Which seems to (indirectly) contradict the javadoc

This frequency count is used to compute uninverted index
(DocTermOrds.uninvert()). The code goes like:

      final int df = te.docFreq();
      if (df <= maxTermDocFreq) {

So, if I happen to have many deleted documents, and maxTermDocFreq is
low, then the term will be excluded (even if the freq of the livedocs
is OK). Most likely, the cache will be incomplete.

Can it be considered a feature? Or is it a bug?

Thanks,

  roman
Michael McCandless | 18 Jul 2012 23:26

Re: TermEnum.docFreq() includes deleted docs

On Tue, Jul 17, 2012 at 12:44 PM, Roman Chyla <roman.chyla <at> gmail.com> wrote:
> Hi,
>
> Tests show that TermEnum.docFreq() returns sum of all docs, including
> the deleted ones. Which seems to (indirectly) contradict the javadoc

That's right; fixing it to reflect deleted documents would be
prohibitively costly.

Hmm which version/javadocs are you looking at?  IndexReader.docFreq at
least calls out this limitation.

> This frequency count is used to compute uninverted index
> (DocTermOrds.uninvert()). The code goes like:
>
>       final int df = te.docFreq();
>       if (df <= maxTermDocFreq) {
>
>
> So, if I happen to have many deleted documents, and maxTermDocFreq is
> low, then the term will be excluded (even if the freq of the livedocs
> is OK). Most likely, the cache will be incomplete.
>
> Can it be considered a feature? Or is it a bug?

Maybe we could pro-rate the return docFreq by the pctg of deleted
documents?  It wouldn't be perfectly correct but on average should
have the right effect (keeping RAM consumption down)?

Can you open a Jira issue?  Thanks.
(Continue reading)


Gmane