Mck | 9 Sep 09:31 2008

Replacing FAST functionality at sesam.no - ShingleFilter+ exact matching

-- original post was on solr's user list. --
-- i've reposted here as it's centered on the ShingleFilter which comes from lucene --

*ShortVersion*
 is there a way to make the ShingleFilter perform exact matching via
inserting ^ $ begin/end markers?

*LongVersion*
At sesam.no we want to replace a FAST (fast.no) Query Matching Server
with a Solr index.

The index we are trying to replace is not a regular index, but specially
configured to perform phrases (and sub-phrases) matches against several
large lists (like an index with only a 'title' field).

I'm not sure of a correct, or logical, name for the behaviour we are
after, but it is like a combination between Shingles and exact matching.

Our test list has 9 entries:
 "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh", "ijkl", "ijkl efgh", "efgh abcd", and "ijkl
efgh abcd".

The query behaviour we are looking for is like:
   (i've included ^$ to denote the exact matching)

Original Query   --> Filtered Query
 abcd            -->  ^abcd$
"abcd efgh"      --> (^abcd$ ^"abcd efgh"$ ^efgh$)
"abcd efgh ijkl" --> (^abcd$ ^"abcd efgh"$ ^"abcd efgh ijkl"$ ^efgh$ ^"efgh ijkl"$ ^ijkl$)

(Continue reading)

Mck | 9 Sep 18:58 2008

Re: Replacing FAST functionality at sesam.no - ShingleFilter+ exact matching

> *ShortVersion*
>  is there a way to make the ShingleFilter perform exact matching via
> inserting ^ $ begin/end markers?

Reading through the mailing list i see how exact matching can be done, a
la STFW to myself...

So the ShortVersion now stands:

For my query "abcd efgh ijkl"
Why does a (perfect looking) MultiPhraseQuery with
	termArrays[0] = { list_entry_shingles:abcd
			  list_entry_shingles:abcd efgh
			  list_entry_shingles:abcd efgh ijkl 
			}
	termArrays[1] = { list_entry_shingles:efgh
			  list_entry_shingles:efgh ijkl 
			}
	termArrays[2] = { list_entry_shingles:ijkl }

return only "abcd efgh ijkl" !?

(when the field is indexed as TextField this is the only hit i get)
(when the field is indexed as StrField i get zero hits!)

When the index contains 9 entries:
 "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh", "ijkl", "ijkl efgh", "efgh abcd", and "ijkl
efgh abcd".

Does this MultiPhraseQuery actually require a match against *every* item
(Continue reading)

Steven A Rowe | 9 Sep 22:11 2008
Picon

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching

Hi mck,

On 09/09/2008 at 12:58 PM, Mck wrote:
> > *ShortVersion*
> >  is there a way to make the ShingleFilter perform exact matching via
> > inserting ^ $ begin/end markers?
> 
> Reading through the mailing list i see how exact matching can
> be done, a la STFW to myself...
> 
> So the ShortVersion now stands:
> 
> For my query "abcd efgh ijkl"
> Why does a (perfect looking) MultiPhraseQuery with
> 	termArrays[0] = { list_entry_shingles:abcd
> 			  list_entry_shingles:abcd efgh
> 			  list_entry_shingles:abcd efgh ijkl
> 			}
> 	termArrays[1] = { list_entry_shingles:efgh
> 			  list_entry_shingles:efgh ijkl
> 			}
> 	termArrays[2] = { list_entry_shingles:ijkl }
> 
> return only "abcd efgh ijkl" !?
> 
> (when the field is indexed as TextField this is the only hit i get)
> (when the field is indexed as StrField i get zero hits!)
> 
> When the index contains 9 entries:
>  "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh",
(Continue reading)

Mck | 9 Sep 22:38 2008

Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching


> Looks to me like MultiPhraseQuery is getting in the way.  Shingles
> that begin at the same word are given the same position by
> ShingleFilter, and Solr's FieldQParserPlugin creates a
> MultiPhraseQuery when it encounters tokens in a query with the same
> position.  I think what you want is to convert queries into shingle
> disjunctions (*any* matching shingle results in a hit),  right?

Yes you're right Steve. thank you.

One way, i see now, to get the behaviour i want is to set the unigrams'
positionIncrement to zero instead of one.

For example in ShingleFilter.fillOutputBuffer(..) replacing the two
ocurrances of 
> .setPositionIncrement(1);
with
> .setPositionIncrement(0);

Then i end up with a MultiPhraseQuery with
        termArrays[0] = { list_entry_shingles:abcd
                          list_entry_shingles:abcd efgh
                          list_entry_shingles:abcd efgh ijkl 
                          list_entry_shingles:efgh
                          list_entry_shingles:efgh ijkl 
                          list_entry_shingles:ijkl }

and it works perfectly :-)

I see no way of configuring this behaviour though. 
(Continue reading)

Steven A Rowe | 9 Sep 23:20 2008
Picon

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching

On 09/09/2008 at 4:38 PM, Mck wrote:
> 
> > Looks to me like MultiPhraseQuery is getting in the way.  Shingles
> > that begin at the same word are given the same position by
> > ShingleFilter, and Solr's FieldQParserPlugin creates a
> > MultiPhraseQuery when it encounters tokens in a query with the same
> > position.  I think what you want is to convert queries into shingle
> > disjunctions (*any* matching shingle results in a hit),  right?
> 
> Yes you're right Steve. thank you.
> 
> One way, i see now, to get the behaviour i want is to set the unigrams'
> positionIncrement to zero instead of one.
> 
> For example in ShingleFilter.fillOutputBuffer(..) replacing the two
> ocurrances of
> > .setPositionIncrement(1); 
> with 
> > .setPositionIncrement(0);
> 
> Then i end up with a MultiPhraseQuery with
>         termArrays[0] = { list_entry_shingles:abcd
>                           list_entry_shingles:abcd efgh
>                           list_entry_shingles:abcd efgh ijkl
>                           list_entry_shingles:efgh
>                           list_entry_shingles:efgh ijkl
>                           list_entry_shingles:ijkl }
> 
> and it works perfectly :-)

(Continue reading)

Mck | 10 Sep 09:55 2008

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching


> probably better to change the one instance of .setPositionIncrement(0)
> to .setPositionIncrement(1) - that way, MultiPhraseQuery will not be
> invoked, and the standard disjunction thing should happen.

Tried this.
As you say i end up with instead a
PhraseQuery
        terms = { list_entry_shingles:abcd
                  list_entry_shingles:abcd efgh
                  list_entry_shingles:abcd efgh ijkl 
                  list_entry_shingles:efgh
                  list_entry_shingles:efgh ijkl 
                  list_entry_shingles:ijkl }

But this does not return the hits i want.
(It returns one hit if TextField and zero hits if StrField, the same
behaviour i mentioned before).

~mck

--

-- 
"Traveller, there are no paths. Paths are made by walking." Australian
Aboriginal saying 
| semb.wever.org | sesat.no | sesam.no |
Steven A Rowe | 10 Sep 17:10 2008
Picon

RE: RE: Re: Replacing FAST functionality at sesam.no -ShingleFilter+exactmatching

Hi mck,

On 09/10/2008 at 3:55 AM, Mck wrote:
> > probably better to change the one instance of .setPositionIncrement(0)
> > to .setPositionIncrement(1) - that way, MultiPhraseQuery will not be
> > invoked, and the standard disjunction thing should happen.
> 
> Tried this.
> As you say i end up with instead a
> PhraseQuery
>         terms = { list_entry_shingles:abcd
>                   list_entry_shingles:abcd efgh 
>                   list_entry_shingles:abcd efgh ijkl
>                   list_entry_shingles:efgh
>                   list_entry_shingles:efgh ijkl
>                   list_entry_shingles:ijkl
>                   }
> 
> But this does not return the hits i want.
> (It returns one hit if TextField and zero hits if StrField, the same
> behaviour i mentioned before).

Have you tried submitting the query without quotes?  (That's where the PhraseQuery likely comes from.)

Steve
Mck | 10 Sep 18:02 2008

Re: Replacing FAST functionality at sesam.no -ShingleFilter+exactmatching


> > But this does not return the hits i want.
> 
> Have you tried submitting the query without quotes?  (That's where the
> PhraseQuery likely comes from.)

Yes. It does not work.
It returns just the unigrams, again the same behaviour as mentioned
earlier.

Debugging ShingleFilter in this case it shows that no shingles are ever
constructed. There are 3 separate tokens in the query and that's all.

The ShingleFilter appears to only work, at least for me, on phrases.
I would think this correct as each shingle is in fact a sub-phrase to
the larger original phrase. Is that presumption correct?

~mck

--

-- 
"Great spirits have always encountered violent opposition from mediocre
minds. The mediocre mind is incapable of understanding the man who
refuses to bow blindly to conventional prejudices and chooses instead to
express his opinions courageously and honestly." Albert Einstein 
| semb.wever.org | sesat.no | sesam.no |
Steven A Rowe | 10 Sep 18:27 2008
Picon

RE: Re: Replacing FAST functionality at sesam.no-ShingleFilter+exactmatching

On 09/10/2008 at 12:02 PM, Mck wrote:
> > > But this does not return the hits i want.
> > 
> > Have you tried submitting the query without quotes? (That's where the
> > PhraseQuery likely comes from.)
> 
> Yes. It does not work. It returns just the unigrams, again the same
> behaviour as mentioned earlier.
> 
> Debugging ShingleFilter in this case it shows that no
> shingles are ever constructed. There are 3 separate tokens in the
> query and that's all.
> 
> The ShingleFilter appears to only work, at least for me, on phrases.
> I would think this correct as each shingle is in fact a sub-phrase to
> the larger original phrase. Is that presumption correct?

ShingleFilter has nothing to do with phrase queries (other than the fact that it can be used to replace them
in some circumstances).

I'm not an expert on Solr query parsing, but there *must* be a way to submit a query that is not turned into a
phrase query.  Really.  

And if you have configured an analyzer that includes a query-time filter, it should be invoked, regardless
of whether a phrase query is constructed.

Steve
Mck | 10 Sep 19:17 2008

Re: Replacing FAST functionality at sesam.no-ShingleFilter+exactmatching

> And if you have configured an analyzer that includes a query-time
> filter, it should be invoked, regardless of whether a phrase query is
> constructed.

sorry steve i failed to explain this so clearly.

Without phrasing the ShingleFilter is indeed invoked.
But it is used three separate times for each term
 1) abcd
 2) efgh
 3) ijkl
So there is no shingles generated.

With phrasing the ShingleFilter it is used once
 1) abcd efgh ijkl
And so all the shingles are generated.

I do not know how Solr and Lucene well enough to appreciate how the
query parsing is working together here.

But what i do see, just within
no.apache.jakarta.lucene.queryParser.QueryParser.getFieldQuery(..)
is that there are three possible return values:
 BooleanQuery, MultiPhraseQuery, or PhraseQuery.

The remaining alternative is BooleanQuery and that happens when
positionCount (which is the sum of all the tokens' positionIncrements)
equals one. That's even tougher to achieve.

~mck
(Continue reading)

Steven A Rowe | 10 Sep 19:48 2008
Picon

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

On 09/10/2008 at 1:17 PM, Mck wrote:
> Without phrasing the ShingleFilter is indeed invoked.
> But it is used three separate times for each term
>  1) abcd
>  2) efgh
>  3) ijkl
> So there is no shingles generated.

Ah, right, each individual token is sent through the analyzer.

> With phrasing the ShingleFilter it is used once
>  1) abcd efgh ijkl
> And so all the shingles are generated.

Wow, I don't see any alternatives to your solution.  

Your solution, on the one hand, however, is a kludge: you are disabling position information (by assigning
the same position to all tokens) in order to induce a particular behavior in the query parser, which may
change in the future.  Long term, I think this should be addressed: there should be a query parser that will
work directly with ShingleFilter, i.e., that will pass all tokens at once to it without requiring quotes.

On the other hand, I'm not sure how useful position information is for shingles in the general case: they
already have relative position info embedded within them.  And how likely is it that one would want to
perform a phrase/span query over shingles?  Pretty unlikely, methinks.

Anyhow, I suggest you change the name of the option you're adding in LUCENE-1380 to "disablePositions",
and make it boolean -- this better describes what you're trying to do.  When true, all position increments
would be set to zero.  It should default to false.

Steve
(Continue reading)

Mck | 15 Sep 11:56 2008

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Steve,
> Your solution, on the one hand, however, is a kludge: you are
> disabling position information (by assigning the same position to all
> tokens) in order to induce a particular behavior in the query parser,
> which may change in the future.

I disagree.

I'm not disabling position information to induce particular behaviour in
the query parser.

I'm intentionally setting position information to zero as I wish _all_
shingles and unigrams to be synonyms of each other.

The query parser expects you to assign positionIncrement=0 for synonyms
in this manner.

The one kludge i see is that the QueryParser expects the total positions
found to be greater than or equal to one. It might not be intentionally
dealing with the total position count being zero. But the situation
where you have many synonyms is the same as having one token and it
having many synonyms, so positionCount=0 == positionCount=1.

I would think that both should lead to a BooleanQuery being constructed
by the QueryParser. (But the synonyms generated by the ShingleFilter are
in fact phrases so maybe it is wiser to use the MultiPhraseQuery.)

So all in all the QueryParser is behaving exactly as i would expect it
to.
The only logic being induced is setting positionIncrement=0 to indicate
(Continue reading)

Mck | 10 Sep 13:04 2008

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching


> [snip] The option thus should be named something like
> "coterminalPositionIncrement".  This seems like a reasonable addition,
> and a patch likely would be accepted, if it included unit tests.

Done.
https://issues.apache.org/jira/browse/LUCENE-1380

~mck

--

-- 
"The only thing I know, is that I know nothing." Socrates 
| semb.wever.org | sesat.no | sesam.no |
Mck | 9 Sep 22:38 2008

Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching


> Looks to me like MultiPhraseQuery is getting in the way.  Shingles
> that begin at the same word are given the same position by
> ShingleFilter, and Solr's FieldQParserPlugin creates a
> MultiPhraseQuery when it encounters tokens in a query with the same
> position.  I think what you want is to convert queries into shingle
> disjunctions (*any* matching shingle results in a hit),  right?

Yes you're right Steve. thank you.

One way, i see now, to get the behaviour i want is to set the unigrams'
positionIncrement to zero instead of one.

For example in ShingleFilter.fillOutputBuffer(..) replacing the two
ocurrances of 
> .setPositionIncrement(1);
with
> .setPositionIncrement(0);

Then i end up with a MultiPhraseQuery with
        termArrays[0] = { list_entry_shingles:abcd
                          list_entry_shingles:abcd efgh
                          list_entry_shingles:abcd efgh ijkl 
                          list_entry_shingles:efgh
                          list_entry_shingles:efgh ijkl 
                          list_entry_shingles:ijkl }

and it works perfectly :-)

I see no way of configuring this behaviour though. 
(Continue reading)


Gmane