Mck | 9 Sep 2008 09:31
Gravatar

Replacing FAST functionality at sesam.no - ShingleFilter+ exact matching

-- original post was on solr's user list. --
-- i've reposted here as it's centered on the ShingleFilter which comes from lucene --

*ShortVersion*
 is there a way to make the ShingleFilter perform exact matching via
inserting ^ $ begin/end markers?

*LongVersion*
At sesam.no we want to replace a FAST (fast.no) Query Matching Server
with a Solr index.

The index we are trying to replace is not a regular index, but specially
configured to perform phrases (and sub-phrases) matches against several
large lists (like an index with only a 'title' field).

I'm not sure of a correct, or logical, name for the behaviour we are
after, but it is like a combination between Shingles and exact matching.

Our test list has 9 entries:
 "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh", "ijkl", "ijkl efgh", "efgh abcd", and "ijkl
efgh abcd".

The query behaviour we are looking for is like:
   (i've included ^$ to denote the exact matching)

Original Query   --> Filtered Query
 abcd            -->  ^abcd$
"abcd efgh"      --> (^abcd$ ^"abcd efgh"$ ^efgh$)
"abcd efgh ijkl" --> (^abcd$ ^"abcd efgh"$ ^"abcd efgh ijkl"$ ^efgh$ ^"efgh ijkl"$ ^ijkl$)

(Continue reading)

Mck | 9 Sep 2008 18:58
Gravatar

Re: Replacing FAST functionality at sesam.no - ShingleFilter+ exact matching

> *ShortVersion*
>  is there a way to make the ShingleFilter perform exact matching via
> inserting ^ $ begin/end markers?

Reading through the mailing list i see how exact matching can be done, a
la STFW to myself...

So the ShortVersion now stands:

For my query "abcd efgh ijkl"
Why does a (perfect looking) MultiPhraseQuery with
	termArrays[0] = { list_entry_shingles:abcd
			  list_entry_shingles:abcd efgh
			  list_entry_shingles:abcd efgh ijkl 
			}
	termArrays[1] = { list_entry_shingles:efgh
			  list_entry_shingles:efgh ijkl 
			}
	termArrays[2] = { list_entry_shingles:ijkl }

return only "abcd efgh ijkl" !?

(when the field is indexed as TextField this is the only hit i get)
(when the field is indexed as StrField i get zero hits!)

When the index contains 9 entries:
 "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh", "ijkl", "ijkl efgh", "efgh abcd", and "ijkl
efgh abcd".

Does this MultiPhraseQuery actually require a match against *every* item
(Continue reading)

Steven A Rowe | 9 Sep 2008 22:11
Picon
Favicon

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching

Hi mck,

On 09/09/2008 at 12:58 PM, Mck wrote:
> > *ShortVersion*
> >  is there a way to make the ShingleFilter perform exact matching via
> > inserting ^ $ begin/end markers?
> 
> Reading through the mailing list i see how exact matching can
> be done, a la STFW to myself...
> 
> So the ShortVersion now stands:
> 
> For my query "abcd efgh ijkl"
> Why does a (perfect looking) MultiPhraseQuery with
> 	termArrays[0] = { list_entry_shingles:abcd
> 			  list_entry_shingles:abcd efgh
> 			  list_entry_shingles:abcd efgh ijkl
> 			}
> 	termArrays[1] = { list_entry_shingles:efgh
> 			  list_entry_shingles:efgh ijkl
> 			}
> 	termArrays[2] = { list_entry_shingles:ijkl }
> 
> return only "abcd efgh ijkl" !?
> 
> (when the field is indexed as TextField this is the only hit i get)
> (when the field is indexed as StrField i get zero hits!)
> 
> When the index contains 9 entries:
>  "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh",
(Continue reading)

Mck | 9 Sep 2008 22:38
Gravatar

Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching


> Looks to me like MultiPhraseQuery is getting in the way.  Shingles
> that begin at the same word are given the same position by
> ShingleFilter, and Solr's FieldQParserPlugin creates a
> MultiPhraseQuery when it encounters tokens in a query with the same
> position.  I think what you want is to convert queries into shingle
> disjunctions (*any* matching shingle results in a hit),  right?

Yes you're right Steve. thank you.

One way, i see now, to get the behaviour i want is to set the unigrams'
positionIncrement to zero instead of one.

For example in ShingleFilter.fillOutputBuffer(..) replacing the two
ocurrances of 
> .setPositionIncrement(1);
with
> .setPositionIncrement(0);

Then i end up with a MultiPhraseQuery with
        termArrays[0] = { list_entry_shingles:abcd
                          list_entry_shingles:abcd efgh
                          list_entry_shingles:abcd efgh ijkl 
                          list_entry_shingles:efgh
                          list_entry_shingles:efgh ijkl 
                          list_entry_shingles:ijkl }

and it works perfectly :-)

I see no way of configuring this behaviour though. 
(Continue reading)

Steven A Rowe | 9 Sep 2008 23:20
Picon
Favicon

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching

On 09/09/2008 at 4:38 PM, Mck wrote:
> 
> > Looks to me like MultiPhraseQuery is getting in the way.  Shingles
> > that begin at the same word are given the same position by
> > ShingleFilter, and Solr's FieldQParserPlugin creates a
> > MultiPhraseQuery when it encounters tokens in a query with the same
> > position.  I think what you want is to convert queries into shingle
> > disjunctions (*any* matching shingle results in a hit),  right?
> 
> Yes you're right Steve. thank you.
> 
> One way, i see now, to get the behaviour i want is to set the unigrams'
> positionIncrement to zero instead of one.
> 
> For example in ShingleFilter.fillOutputBuffer(..) replacing the two
> ocurrances of
> > .setPositionIncrement(1); 
> with 
> > .setPositionIncrement(0);
> 
> Then i end up with a MultiPhraseQuery with
>         termArrays[0] = { list_entry_shingles:abcd
>                           list_entry_shingles:abcd efgh
>                           list_entry_shingles:abcd efgh ijkl
>                           list_entry_shingles:efgh
>                           list_entry_shingles:efgh ijkl
>                           list_entry_shingles:ijkl }
> 
> and it works perfectly :-)

(Continue reading)

Mck | 10 Sep 2008 13:04
Gravatar

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching


> [snip] The option thus should be named something like
> "coterminalPositionIncrement".  This seems like a reasonable addition,
> and a patch likely would be accepted, if it included unit tests.

Done.
https://issues.apache.org/jira/browse/LUCENE-1380

~mck

--

-- 
"The only thing I know, is that I know nothing." Socrates 
| semb.wever.org | sesat.no | sesam.no |
Mck | 10 Sep 2008 09:55
Gravatar

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching


> probably better to change the one instance of .setPositionIncrement(0)
> to .setPositionIncrement(1) - that way, MultiPhraseQuery will not be
> invoked, and the standard disjunction thing should happen.

Tried this.
As you say i end up with instead a
PhraseQuery
        terms = { list_entry_shingles:abcd
                  list_entry_shingles:abcd efgh
                  list_entry_shingles:abcd efgh ijkl 
                  list_entry_shingles:efgh
                  list_entry_shingles:efgh ijkl 
                  list_entry_shingles:ijkl }

But this does not return the hits i want.
(It returns one hit if TextField and zero hits if StrField, the same
behaviour i mentioned before).

~mck

--

-- 
"Traveller, there are no paths. Paths are made by walking." Australian
Aboriginal saying 
| semb.wever.org | sesat.no | sesam.no |
Steven A Rowe | 10 Sep 2008 17:10
Picon
Favicon

RE: RE: Re: Replacing FAST functionality at sesam.no -ShingleFilter+exactmatching

Hi mck,

On 09/10/2008 at 3:55 AM, Mck wrote:
> > probably better to change the one instance of .setPositionIncrement(0)
> > to .setPositionIncrement(1) - that way, MultiPhraseQuery will not be
> > invoked, and the standard disjunction thing should happen.
> 
> Tried this.
> As you say i end up with instead a
> PhraseQuery
>         terms = { list_entry_shingles:abcd
>                   list_entry_shingles:abcd efgh 
>                   list_entry_shingles:abcd efgh ijkl
>                   list_entry_shingles:efgh
>                   list_entry_shingles:efgh ijkl
>                   list_entry_shingles:ijkl
>                   }
> 
> But this does not return the hits i want.
> (It returns one hit if TextField and zero hits if StrField, the same
> behaviour i mentioned before).

Have you tried submitting the query without quotes?  (That's where the PhraseQuery likely comes from.)

Steve
Mck | 10 Sep 2008 18:02
Gravatar

Re: Replacing FAST functionality at sesam.no -ShingleFilter+exactmatching


> > But this does not return the hits i want.
> 
> Have you tried submitting the query without quotes?  (That's where the
> PhraseQuery likely comes from.)

Yes. It does not work.
It returns just the unigrams, again the same behaviour as mentioned
earlier.

Debugging ShingleFilter in this case it shows that no shingles are ever
constructed. There are 3 separate tokens in the query and that's all.

The ShingleFilter appears to only work, at least for me, on phrases.
I would think this correct as each shingle is in fact a sub-phrase to
the larger original phrase. Is that presumption correct?

~mck

--

-- 
"Great spirits have always encountered violent opposition from mediocre
minds. The mediocre mind is incapable of understanding the man who
refuses to bow blindly to conventional prejudices and chooses instead to
express his opinions courageously and honestly." Albert Einstein 
| semb.wever.org | sesat.no | sesam.no |
Steven A Rowe | 10 Sep 2008 18:27
Picon
Favicon

RE: Re: Replacing FAST functionality at sesam.no-ShingleFilter+exactmatching

On 09/10/2008 at 12:02 PM, Mck wrote:
> > > But this does not return the hits i want.
> > 
> > Have you tried submitting the query without quotes? (That's where the
> > PhraseQuery likely comes from.)
> 
> Yes. It does not work. It returns just the unigrams, again the same
> behaviour as mentioned earlier.
> 
> Debugging ShingleFilter in this case it shows that no
> shingles are ever constructed. There are 3 separate tokens in the
> query and that's all.
> 
> The ShingleFilter appears to only work, at least for me, on phrases.
> I would think this correct as each shingle is in fact a sub-phrase to
> the larger original phrase. Is that presumption correct?

ShingleFilter has nothing to do with phrase queries (other than the fact that it can be used to replace them
in some circumstances).

I'm not an expert on Solr query parsing, but there *must* be a way to submit a query that is not turned into a
phrase query.  Really.  

And if you have configured an analyzer that includes a query-time filter, it should be invoked, regardless
of whether a phrase query is constructed.

Steve
Mck | 10 Sep 2008 19:17
Gravatar

Re: Replacing FAST functionality at sesam.no-ShingleFilter+exactmatching

> And if you have configured an analyzer that includes a query-time
> filter, it should be invoked, regardless of whether a phrase query is
> constructed.

sorry steve i failed to explain this so clearly.

Without phrasing the ShingleFilter is indeed invoked.
But it is used three separate times for each term
 1) abcd
 2) efgh
 3) ijkl
So there is no shingles generated.

With phrasing the ShingleFilter it is used once
 1) abcd efgh ijkl
And so all the shingles are generated.

I do not know how Solr and Lucene well enough to appreciate how the
query parsing is working together here.

But what i do see, just within
no.apache.jakarta.lucene.queryParser.QueryParser.getFieldQuery(..)
is that there are three possible return values:
 BooleanQuery, MultiPhraseQuery, or PhraseQuery.

The remaining alternative is BooleanQuery and that happens when
positionCount (which is the sum of all the tokens' positionIncrements)
equals one. That's even tougher to achieve.

~mck
(Continue reading)

Steven A Rowe | 10 Sep 2008 19:48
Picon
Favicon

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

On 09/10/2008 at 1:17 PM, Mck wrote:
> Without phrasing the ShingleFilter is indeed invoked.
> But it is used three separate times for each term
>  1) abcd
>  2) efgh
>  3) ijkl
> So there is no shingles generated.

Ah, right, each individual token is sent through the analyzer.

> With phrasing the ShingleFilter it is used once
>  1) abcd efgh ijkl
> And so all the shingles are generated.

Wow, I don't see any alternatives to your solution.  

Your solution, on the one hand, however, is a kludge: you are disabling position information (by assigning
the same position to all tokens) in order to induce a particular behavior in the query parser, which may
change in the future.  Long term, I think this should be addressed: there should be a query parser that will
work directly with ShingleFilter, i.e., that will pass all tokens at once to it without requiring quotes.

On the other hand, I'm not sure how useful position information is for shingles in the general case: they
already have relative position info embedded within them.  And how likely is it that one would want to
perform a phrase/span query over shingles?  Pretty unlikely, methinks.

Anyhow, I suggest you change the name of the option you're adding in LUCENE-1380 to "disablePositions",
and make it boolean -- this better describes what you're trying to do.  When true, all position increments
would be set to zero.  It should default to false.

Steve
(Continue reading)

Mck | 15 Sep 2008 11:56
Gravatar

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Steve,
> Your solution, on the one hand, however, is a kludge: you are
> disabling position information (by assigning the same position to all
> tokens) in order to induce a particular behavior in the query parser,
> which may change in the future.

I disagree.

I'm not disabling position information to induce particular behaviour in
the query parser.

I'm intentionally setting position information to zero as I wish _all_
shingles and unigrams to be synonyms of each other.

The query parser expects you to assign positionIncrement=0 for synonyms
in this manner.

The one kludge i see is that the QueryParser expects the total positions
found to be greater than or equal to one. It might not be intentionally
dealing with the total position count being zero. But the situation
where you have many synonyms is the same as having one token and it
having many synonyms, so positionCount=0 == positionCount=1.

I would think that both should lead to a BooleanQuery being constructed
by the QueryParser. (But the synonyms generated by the ShingleFilter are
in fact phrases so maybe it is wiser to use the MultiPhraseQuery.)

So all in all the QueryParser is behaving exactly as i would expect it
to.
The only logic being induced is setting positionIncrement=0 to indicate
(Continue reading)

Mck | 15 Sep 2008 11:56
Gravatar

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Steve,
> Your solution, on the one hand, however, is a kludge: you are
> disabling position information (by assigning the same position to all
> tokens) in order to induce a particular behavior in the query parser,
> which may change in the future.

I disagree.

I'm not disabling position information to induce particular behaviour in
the query parser.

I'm intentionally setting position information to zero as I wish _all_
shingles and unigrams to be synonyms of each other.

The query parser expects you to assign positionIncrement=0 for synonyms
in this manner.

The one kludge i see is that the QueryParser expects the total positions
found to be greater than or equal to one. It might not be intentionally
dealing with the total position count being zero. But the situation
where you have many synonyms is the same as having one token and it
having many synonyms, so positionCount=0 == positionCount=1.

I would think that both should lead to a BooleanQuery being constructed
by the QueryParser. (But the synonyms generated by the ShingleFilter are
in fact phrases so maybe it is wiser to use the MultiPhraseQuery.)

So all in all the QueryParser is behaving exactly as i would expect it
to.
The only logic being induced is setting positionIncrement=0 to indicate
(Continue reading)

Chris Hostetter | 16 Sep 2008 22:59

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching


: The query parser expects you to assign positionIncrement=0 for synonyms
: in this manner.

correct.

: The one kludge i see is that the QueryParser expects the total positions
: found to be greater than or equal to one. It might not be intentionally
: dealing with the total position count being zero. But the situation
: where you have many synonyms is the same as having one token and it
: having many synonyms, so positionCount=0 == positionCount=1.

there has definitely been some wonkiness in various places in the code 
relating to the first token not having a positionIncremenet of "1" ... i 
don't rememebr the details, and maybe it works fine even if every token in 
a stream is "0" but the safe thing to do is make sure the first token has 
a positionIncrement of "1" and the 'synonyms" after that use an increment 
of "0"

This is important not only in case the Lucene internals freak out when 
the "first" token has an increment of "0" but also because you have no way 
of knowing if the first token you produce is really the first token being 
given to the IndexWriter (or QueryParser or what have you)

To be a well behave TokenStream producer you can't assume you opperate in 
a vacume:

1) multiple "Field" instances with the same field name could be added to a 
document, with an Analyzer that uses your Filter but doesn't define any 
particular positionIncrementGap ... if every token you produce has an 
(Continue reading)

Mck | 17 Sep 2008 10:17
Gravatar

Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Chris,

> the safe thing to do is make sure the first token has 
> a positionIncrement of "1" and the 'synonyms" after that use an
> increment of "0"

Yes it makes sense to have at minimum the first token with
positionIncrement=1
I didn't see outside "the vacuum" at all, thank you for explaining.

Does it make sense for me to rewrite the ShingleFilter patch to ensure
the first token returned always has positionIncrement=1 regardless if
enablePositions is true or false?

(i'll test too that BooleanQuery works as presumed in my case...)

Would such a rewrite of the ShingleFilter patch be a substitute for the
custom Analyzer you talk about?
(i'm pushing to keep any patch restricted to the ShingleFilter since my
gut feeling is still that's where the change in behaviour is).

~mck

--

-- 
"Between two evils, I always pick the one I never tried before." Mae
West 
| semb.wever.org | sesat.no | sesam.no |
Chris Hostetter | 17 Sep 2008 20:50

Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

: Does it make sense for me to rewrite the ShingleFilter patch to ensure
: the first token returned always has positionIncrement=1 regardless if
: enablePositions is true or false?

I don't know anything about ShingleFilter -- i've never looked at it and 
i'm not entirely certain i even understand what it does, let alone how 
your patch is attempting to modify it -- but if you've got code that takes 
in text and it produces multiple tokens as a result, then that first token 
it produces should (probably) have a non-zero positionIncrement.  if your 
code takes in "a" Token and produces multiple tokens to replace it, then 
the first token you produce should (probably) have the same 
positionIncrement as the input Token.

(Disclaimer; it's been a lon time since i worked with the TokenStream 
APIs, so i'm hoping my memory isn't faulty and someone else backs me up 
with a "yeah, that's correct")

-Hoss
Mck | 17 Sep 2008 13:32
Gravatar

Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

> (i'll test too that BooleanQuery works as presumed in my case...)

Indeed it works beautifully with a BooleanQuery.
I've updated the patch to LUCENE-1380

~mck

--

-- 
"If you have any trouble sounding condescending, find a Unix user to
show you how it's done." Scott Adams 
| semb.wever.org | sesat.no | sesam.no |
Chris Hostetter | 10 Sep 2008 23:47

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching


: change in the future.  Long term, I think this should be addressed: 
: there should be a query parser that will work directly with 
: ShingleFilter, i.e., that will pass all tokens at once to it without 
: requiring quotes.

I won't pretend that i've been following this thread all that closely, but 
to chime in here real fast:  converting query strings to Query obejcts 
usually involves two passes: pass one is involves QueryParser (or 
something like it) to check for meta-markup; pass two involves the 
QueryParsersending chunks of text to the appropriate Analyzer.

In Solr, when you use the default QParser, the input is passed to the 
Lucene QueryParser which (in the absence of quotes) splits on whitespace, 
and treats each chunk seperately before passing them to the appropriate 
Analyzer (it does this because each chunk could have a differnet field 
name)

BUT! ... there are (as of Solr 1.3) many QParser options which can be 
selected at query time.  Yonik even added a whole new string prefix syntax 
so you can pick the QParser per individual querystring-esqe param (ie: q, 
fq, bq, facet.query, etc...)

The FieldQParserPlugin in particular passes the entire querystring to the 
Analyzer for the field specified by an "f" param as a single chunk...

         {!field f=yourfieldName}Some input that can have spaces

http://localhost:8983/solr/select/?debugQuery=true&rows=0&q=%7B%21field+f%3Dname%7DFoo+Bar

(Continue reading)

Mck | 11 Sep 2008 09:51
Gravatar

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

On Wed, 2008-09-10 at 14:47 -0700, Chris Hostetter wrote:
> The FieldQParserPlugin in particular passes the entire querystring to
> the 
> Analyzer for the field specified by an "f" param as a single chunk...
> 
>          {!field f=yourfieldName}Some input that can have spaces
> 
> http://localhost:8983/solr/select/?debugQuery=true&rows=0&q=%7B%
> 21field+f%3Dname%7DFoo+Bar

But at the end of the day will
{!field f:list_entry_shingle}abcd efgh ijkl
still end up as
list_entry_shingle:"abcd efgh ijkl"
?

I was unable to get a url like
http://localhost:8080/solr/select/?debugQuery=true&rows=0&q={!field%20f:list_entry_shingle}abcd%20efgh%20ijkl
to work. I got 
> org.apache.lucene.queryParser.ParseException: Expected identifier at
> pos 9 str='{!field f:list_entry_shingle}abcd efgh ijkl'

I ask because the javadoc indicates this and from what i can see in
FieldQParserPlugin you still end up with one of the same three return
types:
BooleanQuery, MultiPhraseQuery, or PhraseQuery.

So as long as we have shingles with positionIncrement=0 and unigrams (or
the first shingles) with positionIncrement=1 you'll end up with a
MultiPhraseQuery.
(Continue reading)

Yonik Seeley | 11 Sep 2008 15:35
Picon
Favicon

Re: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

On Thu, Sep 11, 2008 at 3:51 AM, Mck <mick <at> semb.wever.org> wrote:
> On Wed, 2008-09-10 at 14:47 -0700, Chris Hostetter wrote:
>> The FieldQParserPlugin in particular passes the entire querystring to
>> the
>> Analyzer for the field specified by an "f" param as a single chunk...
>>
>>          {!field f=yourfieldName}Some input that can have spaces
>>
>> http://localhost:8983/solr/select/?debugQuery=true&rows=0&q=%7B%
>> 21field+f%3Dname%7DFoo+Bar
>
> But at the end of the day will
> {!field f:list_entry_shingle}abcd efgh ijkl
> still end up as
> list_entry_shingle:"abcd efgh ijkl"
> ?

Yes... the "field" QParser was meant to duplicate the logic of the
Lucene QueryParser for a single field (while avoiding the need to
escape the content, etc.)  So this new syntax was an aside... not a
way to solve your problem unless you want to write your own QParser.

-Yonik
Steven A Rowe | 11 Sep 2008 00:44
Picon
Favicon

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

On 09/10/2008 at 5:47 PM, Chris Hostetter wrote:
> BUT! ... there are (as of Solr 1.3) many QParser options which can be
> selected at query time.  Yonik even added a whole new string
> prefix syntax so you can pick the QParser per individual
> querystring-esqe param (ie: q, fq, bq, facet.query, etc...)
> 
> The FieldQParserPlugin in particular passes the entire
> querystring to the Analyzer for the field specified by an
> "f" param as a single chunk...
> 
>          {!field f=yourfieldName}Some input that can have spaces
> 
> http://localhost:8983/solr/select/?debugQuery=true&rows=0&q=%7B%21field+f%3Dname%7DFoo+Bar

Is the query-time syntax documented anywhere?  I can't find anything about this on the wiki.

Steve
Yonik Seeley | 11 Sep 2008 02:56
Picon
Favicon

Re: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

On Wed, Sep 10, 2008 at 6:44 PM, Steven A Rowe <sarowe <at> syr.edu> wrote:
> On 09/10/2008 at 5:47 PM, Chris Hostetter wrote:
>> BUT! ... there are (as of Solr 1.3) many QParser options which can be
>> selected at query time.  Yonik even added a whole new string
>> prefix syntax so you can pick the QParser per individual
>> querystring-esqe param (ie: q, fq, bq, facet.query, etc...)
>>
>> The FieldQParserPlugin in particular passes the entire
>> querystring to the Analyzer for the field specified by an
>> "f" param as a single chunk...
>>
>>          {!field f=yourfieldName}Some input that can have spaces
>>
>> http://localhost:8983/solr/select/?debugQuery=true&rows=0&q=%7B%21field+f%3Dname%7DFoo+Bar
>
> Is the query-time syntax documented anywhere?  I can't find anything about this on the wiki.

It's on my todo list for this release.
Javadoc is the best we have now... just look at the subclasses:
http://lucene.apache.org/solr/api/org/apache/solr/search/QParserPlugin.html

-Yonik
Mck | 10 Sep 2008 20:16
Gravatar

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

> Anyhow, I suggest you change the name of the option you're adding in
> LUCENE-1380 to "disablePositions", and make it boolean -- this better
> describes what you're trying to do.  When true, all position
> increments would be set to zero.  It should default to false.

disablePositions it is. thanks for helping steve.

~mck

--

-- 
"Do not seek to follow in the footsteps of those of old - seek what they
sought." Matsuo Basho 
| semb.wever.org | sesat.no | sesam.no |
Mck | 11 Sep 2008 14:08
Gravatar

Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

On Wed, 2008-09-10 at 20:16 +0200, Mck wrote:
> > Anyhow, I suggest you change the name of the option you're adding in
> > LUCENE-1380 to "disablePositions", and make it boolean -- this
> better
> > describes what you're trying to do.  When true, all position
> > increments would be set to zero.  It should default to false.
> 
> disablePositions it is.

You mind if i reverse this to "enablePositions" (that by default is
true)?
I'm not very fond of double-negatives.
~mck

--

-- 
"If you are distressed by anything external, the pain is not due to the
thing itself, but to your estimate of it. This you have the power to
revoke." Marcus Aurelius 
| semb.wever.org | sesat.no | sesam.no |
Steven A Rowe | 11 Sep 2008 16:50
Picon
Favicon

RE: Re: Replacing FAST functionalityatsesam.no-ShingleFilter+exactmatching

On 09/11/2008 at 8:08 AM, Mck wrote:
> On Wed, 2008-09-10 at 20:16 +0200, Mck wrote:
> > > Anyhow, I suggest you change the name of the option you're adding in
> > > LUCENE-1380 to "disablePositions", and make it boolean -- this better
> > > describes what you're trying to do.  When true, all position
> > > increments would be set to zero.  It should default to false.
> > 
> > disablePositions it is.
> 
> You mind if i reverse this to "enablePositions" (that by default is
> true)? I'm not very fond of double-negatives. ~mck

Excellent idea.

Steve

Otis Gospodnetic | 8 Sep 2008 18:21
Picon
Favicon

Re: Replacing FAST functionality at sesam.no

Just glancing over this.  I believe one of the recent shingle contributions over in Lucene contrib/ indeed
has the option to add those begin/end marker characters, so if this will solve your exact matching needs,
that's the thing to look at.  You'll have to write (and contribute?) a bit of glue to use it in Solr.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Mck <mick <at> semb.wever.org>
> To: solr-user <at> lucene.apache.org
> Sent: Monday, September 8, 2008 4:43:50 AM
> Subject: Re: Replacing FAST functionality at sesam.no
> 
> > I'm not very familiar with shingles but it seems to be that you should
> > have ShingleFilter at index time and make the query as a phrase query?
> 
> Then the entry "abcd efgh ijkl" would be indexed as 
> (abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl)
> 
> and a subsequent query "abcd" would return this entry.
> If this is so then this is not exact matching and not what we are
> looking for.
> 
> The filter behaviour we are looking for is like:
>    (i've included ^$ to denote the exact matching)
> 
> Original Query   --> Filtered Query
> abcd            -->  ^abcd$
> "abcd efgh"      --> (^abcd$ ^"abcd efgh"$ ^efgh$)
(Continue reading)

Otis Gospodnetic | 27 Aug 2008 20:28
Picon
Favicon

Re: Replacing FAST functionality at sesam.no

The screenshot didn't make it.... (some attachments gets stripped)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Glenn-Erik Sandbakken <glenn <at> sesam.no>
> To: solr-user <at> lucene.apache.org
> Sent: Wednesday, August 27, 2008 1:44:53 PM
> Subject: Replacing FAST functionality at sesam.no
> 
> At sesam.no we want to replace a FAST (fast.no) Query Matching Server
> with a Solr index.
> 
> The index we are trying to replace is not a regular index, but specially
> configured to perform phrases (and sub-phrases) matches against several
> large lists (like an index with only a 'title' field).
> 
> I'm not sure of a correct, or logical, name for the behavior we are
> after, but it is like a combination between Shingles and exact matching.
> 
> Some examples should explain it well.
> 
> Lets say we have the following list:
> > one two three
> > one two
> > two three
> > one
> > two
(Continue reading)

Glenn-Erik | 28 Aug 2008 11:10
Picon

Re: Replacing FAST functionality at sesam.no

> The screenshot didn't make it.... (some attachments gets stripped)
I have put the screenshots here:
http://www.glennerik.com/solr/solrshingle1.gif
and here:
http://www.glennerik.com/solr/solrshingle2.gif
I also put the schema.xml here:
http://www.glennerik.com/solr/schema.xml

> This sounds very much like shingles of variable length (1 to
length(terms in query)).
> Make sure you turn them into phrase queries and combine them with ORs
and things should work then.
(from your answer on the dev mailing list)
We have always had the solrQueryParser defaultOperator="OR"
(but I have tested with AND just to see the result)
I am not sure what you mean with "turn them into phrase queries", we
don't know about query analysis phrasing.

- Glenn-Erik

Glenn-Erik Sandbakken | 27 Aug 2008 19:44
Picon

Replacing FAST functionality at sesam.no

At sesam.no we want to replace a FAST (fast.no) Query Matching Server
with a Solr index.

The index we are trying to replace is not a regular index, but specially
configured to perform phrases (and sub-phrases) matches against several
large lists (like an index with only a 'title' field).

I'm not sure of a correct, or logical, name for the behavior we are
after, but it is like a combination between Shingles and exact matching.

Some examples should explain it well.

Lets say we have the following list:
> one two three
> one two
> two three
> one
> two
> three
> three two
> two one
> one three
> three one

For the query "one two three", we need hits against, and only against:
> one two three
> one two
> two three
> one
> two
(Continue reading)

Svein Parnas | 27 Aug 2008 21:47
Picon

Re: Replacing FAST functionality at sesam.no


On 27. aug.. 2008, at 19.44, Glenn-Erik Sandbakken wrote:

> At sesam.no we want to replace a FAST (fast.no) Query Matching Server
> with a Solr index.
>
> The index we are trying to replace is not a regular index, but  
> specially
> configured to perform phrases (and sub-phrases) matches against  
> several
> large lists (like an index with only a 'title' field).
>
> I'm not sure of a correct, or logical, name for the behavior we are
> after, but it is like a combination between Shingles and exact  
> matching.
>
> Some examples should explain it well.

In order to do this, you can´t use the ShingleFilter during indexing  
since a document like "one two three" and a query like "one two four"  
will match since they have the shingle "one two" in common.

You will get what you want, I think, if you don´t tokenize during  
indexing (some normalization will be required if your lists aren't  
normalized to begin with) and apply the ShingleFilter only to the  
queries.

Svein

(Continue reading)

Glenn-Erik | 28 Aug 2008 11:19
Picon

Re: Replacing FAST functionality at sesam.no

>In order to do this, you can't use the ShingleFilter during indexing  
>since a document like "one two three" and a query like "one two four"  
>will match since they have the shingle "one two" in common.
Hello Svein, nice to meet you in this place =)
I have been trying with and without <analyzer type="index">
and also <analyzer type="query">
I have also been trying with and without outputUnigrams="true" for
analyzer type=index and analyzer type=query
And I have been trying with and without outputUnigramIfNoNgram="true"
for analyzer type=index (only)
I am pretty sure I have been trying all possible combinations of
switching all of this on and off.
I have never seen exactly the expected result.

>You will get what you want, I think, if you don't tokenize during  
>indexing (some normalization will be required if your lists aren't  
>normalized to begin with) and apply the ShingleFilter only to the  
>queries.
I also think that this sounds like the most logical configuration,
but such a configuration doesn't give us the expected results.
(Un?=)fortunately I am leaving on a two week vacation in one hour.
I'd love to follow up on this the coming days,
but Mick Semb Wever will be taking over this job for the next two weeks.

- Glenn-Erik Sandbakken

Mck | 6 Sep 2008 15:25
Gravatar

Re: Replacing FAST functionality at sesam.no

> but Mick Semb Wever will be taking over this job for the next two weeks.

back from holidays and taking over where Glenn-Erik left. i'm very new
to Solr so please bear with me, 

i'll run through our setup from scratch.

Our test list has 9 entries:
 "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh", "ijkl",
"ijkl efgh", "efgh abcd", and "ijkl efgh abcd".

I'm using a trunk build of Solr, and using the example/solr for the solr
home.

Editing schema.xml so to put these entries in as type="string" and using
defaultOperator="OR" gives the expected exact matching functionality
given queries are quoted, eg /solr/select/?q="abcd efgh ijkl"

So then i change type="string" to type="shingleString" along with

> <fieldType name="shingleString" class="solr.StrField" positionIncrementGap="100"
omitNorms="true" >
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
outputUnigramIfNoNgram="true" maxShingleSize="99" />
>       </analyzer>
(Continue reading)

Mck | 8 Sep 2008 09:30
Gravatar

Re: Replacing FAST functionality at sesam.no

> So then i change type="string" to type="shingleString" along with
> > [snip]
> >       <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
outputUnigramIfNoNgram="true" maxShingleSize="99" />
> >       </analyzer>

Debugging ShingleFilter I see that without quotes the shingles
StringBuffer array consists of just the current token.

When the query does have quotes the shingles array fills up with the
expected shingles.
And the Query (infact a MultiPhraseQuery)
  returned from SolrQueryParser.getFieldQuery()
  looks like

list_entry_shingle:"(abcd abcd efgh abcd efgh ijkl) (efgh efgh ijkl) ijkl"

I'm struggling to make sense of this.
How can the shingles be matched if they aren't quoted?
Why put the parenthesis () when the query has default operator OR?

I would be expecting a Query instead like:
abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl

(This with the ShingleFilter disabled does indeed work perfectly).

Am i barking up the wrong tree?
Is there a way to get the shingles phrased?
(Continue reading)

Shalin Shekhar Mangar | 8 Sep 2008 09:55
Picon
Gravatar

Re: Replacing FAST functionality at sesam.no

I'm not very familiar with shingles but it seems to be that you should have
ShingleFilter at index time and make the query as a phrase query?

On Mon, Sep 8, 2008 at 1:00 PM, Mck <mick <at> semb.wever.org> wrote:

> > So then i change type="string" to type="shingleString" along with
> > > [snip]
> > >       <analyzer type="query">
> > >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >         <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> outputUnigramIfNoNgram="true" maxShingleSize="99" />
> > >       </analyzer>
>
> Debugging ShingleFilter I see that without quotes the shingles
> StringBuffer array consists of just the current token.
>
> When the query does have quotes the shingles array fills up with the
> expected shingles.
> And the Query (infact a MultiPhraseQuery)
>  returned from SolrQueryParser.getFieldQuery()
>  looks like
>
> list_entry_shingle:"(abcd abcd efgh abcd efgh ijkl) (efgh efgh ijkl) ijkl"
>
> I'm struggling to make sense of this.
> How can the shingles be matched if they aren't quoted?
> Why put the parenthesis () when the query has default operator OR?
>
> I would be expecting a Query instead like:
> abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl
(Continue reading)

Mck | 8 Sep 2008 10:43
Gravatar

Re: Replacing FAST functionality at sesam.no

> I'm not very familiar with shingles but it seems to be that you should
> have ShingleFilter at index time and make the query as a phrase query?

Then the entry "abcd efgh ijkl" would be indexed as 
(abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl)

and a subsequent query "abcd" would return this entry.
If this is so then this is not exact matching and not what we are
looking for.

The filter behaviour we are looking for is like:
   (i've included ^$ to denote the exact matching)

Original Query   --> Filtered Query
 abcd            -->  ^abcd$
"abcd efgh"      --> (^abcd$ ^"abcd efgh"$ ^efgh$)
"abcd efgh ijkl" --> (^abcd$ ^"abcd efgh"$ ^"abcd efgh ijkl"$ ^efgh$ ^"efgh ijkl"$ ^ijkl$)

~mck

--

-- 
"All stable processes we shall predict. All unstable processes we shall
control." John von Neumann 
| semb.wever.org | sesat.no | sesam.no |
Mck | 9 Sep 2008 22:38
Gravatar

Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching


> Looks to me like MultiPhraseQuery is getting in the way.  Shingles
> that begin at the same word are given the same position by
> ShingleFilter, and Solr's FieldQParserPlugin creates a
> MultiPhraseQuery when it encounters tokens in a query with the same
> position.  I think what you want is to convert queries into shingle
> disjunctions (*any* matching shingle results in a hit),  right?

Yes you're right Steve. thank you.

One way, i see now, to get the behaviour i want is to set the unigrams'
positionIncrement to zero instead of one.

For example in ShingleFilter.fillOutputBuffer(..) replacing the two
ocurrances of 
> .setPositionIncrement(1);
with
> .setPositionIncrement(0);

Then i end up with a MultiPhraseQuery with
        termArrays[0] = { list_entry_shingles:abcd
                          list_entry_shingles:abcd efgh
                          list_entry_shingles:abcd efgh ijkl 
                          list_entry_shingles:efgh
                          list_entry_shingles:efgh ijkl 
                          list_entry_shingles:ijkl }

and it works perfectly :-)

I see no way of configuring this behaviour though. 
(Continue reading)


Gmane