Christian Stocker | 26 Mar 22:28 2012
Picon

Strange Search Performance problem with OR

Hi

We have the following search query

SELECT * FROM [own:unstructured] AS data WHERE data.guid = 'J7B1X'
		AND (ISDESCENDANTNODE(data, '/article')
		OR ISDESCENDANTNODE(data, '/import/article')
		)
		ORDER BY firstImportDate DESC

This query can take quite some time (up to 3 seconds, but it gets more
and more hte more data we have). In /article there's potentially a lot
of nodes, in /import/article usually almost nil.

If we now separate the query into 2:

SELECT * FROM [own:unstructured] AS data WHERE data.guid = 'J7B1X'
		AND ISDESCENDANTNODE(data, '/article')
		ORDER BY firstImportDate DESC

and

SELECT * FROM [own:unstructured] AS data WHERE data.guid = 'J7B1X'
		AND ISDESCENDANTNODE(data, '/import/article')
		ORDER BY firstImportDate DESC

Both queries take approx. 10ms (and return 0 or 1 resultset, more is not
possible). So quite fast.

Can anyone explain to me, why that is and how we could rewrite the query
(Continue reading)

David Buchmann | 27 Mar 09:10 2012
Picon

Re: Strange Search Performance problem with OR


i think the 2 queries are not equivalent. the first one is equivalent to

...
WHERE data.guid = 'J7B1X'
  AND (ISDESCENDANTNODE(data, '/article')

plus

WHERE
 ISDESCENDANTNODE(data, '/import/article')

(if you want the data.guid = ... to apply to both, you need paranthesis)

but if /import/article is almost empty, i still don't see why the
combined query should take so long unless jackrabbit/lucene are doing
something stupid.

cheers,david

Am 26.03.2012 22:28, schrieb Christian Stocker:
> Hi
> 
> We have the following search query
> 
> 
> SELECT * FROM [own:unstructured] AS data WHERE data.guid = 'J7B1X'
> 		AND (ISDESCENDANTNODE(data, '/article')
> 		OR ISDESCENDANTNODE(data, '/import/article')
> 		)
(Continue reading)

David Buchmann | 27 Mar 09:49 2012
Picon

Re: Strange Search Performance problem with OR


sorry, my bad. did not read correctly.
you do have the paranthesis so you did what you wanted to do.

looks like lucene/jackrabbit combine the 2 datasets first and filter
later...

what if you try

SELECT * FROM [own:unstructured] AS data
WHERE
    data.guid = 'J7B1X' AND ISDESCENDANTNODE(data, '/article')
  OR
    data.guid = 'J7B1X' AND ISDESCENDANTNODE(data, '/import/article')
ORDER BY firstImportDate DESC

if this is fast, then the jackrabbit query engine is not very clever...

cheers,david

Am 27.03.2012 09:10, schrieb David Buchmann:
> i think the 2 queries are not equivalent. the first one is equivalent to
> 
> ...
> WHERE data.guid = 'J7B1X'
>   AND (ISDESCENDANTNODE(data, '/article')
> 
> plus
> 
> WHERE
(Continue reading)

Christian Stocker | 27 Mar 09:55 2012
Picon

Re: Strange Search Performance problem with OR

Hi

On 27.03.12 09:49, David Buchmann wrote:
> sorry, my bad. did not read correctly.
> you do have the paranthesis so you did what you wanted to do.
> 
> looks like lucene/jackrabbit combine the 2 datasets first and filter
> later...
> 
> what if you try
> 
> 
> SELECT * FROM [own:unstructured] AS data
> WHERE
>     data.guid = 'J7B1X' AND ISDESCENDANTNODE(data, '/article')
>   OR
>     data.guid = 'J7B1X' AND ISDESCENDANTNODE(data, '/import/article')
> ORDER BY firstImportDate DESC

I tried that and I tried it again now. Same response time as the
original query.

Any hints from someone who knows the internal workings of jackrabbit/lucene?

chregu

> 
> if this is fast, then the jackrabbit query engine is not very clever...
> 
> cheers,david
(Continue reading)

Alex Parvulescu | 27 Mar 13:30 2012
Picon

Re: Strange Search Performance problem with OR

Hi Christian,

can you enable debug logs
on org.apache.jackrabbit.core.query.lucene.join.QueryEngine?
I'm curious to see what the constraits look like in the big query vs the 2
small ones.

This also goes for the join you've mentioned later in the thread, but I
just wanted to start with the first query ;)

alex

On Tue, Mar 27, 2012 at 9:55 AM, Christian Stocker <
christian.stocker <at> liip.ch> wrote:

> Hi
>
> On 27.03.12 09:49, David Buchmann wrote:
> > sorry, my bad. did not read correctly.
> > you do have the paranthesis so you did what you wanted to do.
> >
> > looks like lucene/jackrabbit combine the 2 datasets first and filter
> > later...
> >
> > what if you try
> >
> >
> > SELECT * FROM [own:unstructured] AS data
> > WHERE
> >     data.guid = 'J7B1X' AND ISDESCENDANTNODE(data, '/article')
(Continue reading)

Christian Stocker | 27 Mar 15:24 2012
Picon

Re: Strange Search Performance problem with OR

Hi Alex

unfortunately the debug is not that descriptive (from my POV :))

For the query with the OR

2012-03-27 15:14:44.086 DEBUG [606360852 <at> qtp-1418257117-3]
QueryEngine.java:465   SQL2 SELECT took 6642 ms. selector:
[own:unstructured] AS data, columns: [data.jcr:primaryType], constraint:
(data.guid = 'J7B1X') AND ((ISDESCENDANTNODE(data, [/article])) OR
(ISDESCENDANTNODE(data, [/import/article]))), offset 0, limit 5
2012-03-27 15:14:44.087 DEBUG [606360852 <at> qtp-1418257117-3]
QueryEngine.java:480   SQL2 SORT took 0 ms.
2012-03-27 15:14:44.087 DEBUG [606360852 <at> qtp-1418257117-3]
QueryEngine.java:117 SQL2 QUERY execute took 6643 ms. native sort is false.

For the query without the or

2012-03-27 15:15:44.774 DEBUG [606360852 <at> qtp-1418257117-3]
QueryEngine.java:465   SQL2 SELECT took 5 ms. selector:
[own:unstructured] AS data, columns: [data.jcr:primaryType], constraint:
(data.guid = 'J7B1X') AND (ISDESCENDANTNODE(data, [/article])), offset
0, limit 5
2012-03-27 15:15:44.775 DEBUG [606360852 <at> qtp-1418257117-3]
QueryEngine.java:480   SQL2 SORT took 0 ms.
2012-03-27 15:15:44.775 DEBUG [606360852 <at> qtp-1418257117-3]
QueryEngine.java:117 SQL2 QUERY execute took 6 ms. native sort is false.

For the later with no results:
(or as file https://gist.github.com/761084c44997d6c7ad11)
(Continue reading)

Alex Parvulescu | 27 Mar 17:49 2012
Picon

Re: Strange Search Performance problem with OR

hi,

> unfortunately the debug is not that descriptive (from my POV :))

yes, well the idea is to make it more user friendly :)
I added it with joins in mind so it doesn't say much (ahem, anything) about
how normal queries break down to lucene queries.

About the join query: it appears that in the case of the 0 hits, the query
engine is terribly inefficient, it fetches the entire subset of nodes:

> SQL2 JOIN LEFT SIDE took 8 ms. fetched 0 rows.
> SQL2 JOIN RIGHT SIDE took 845 ms. fetched 13055 rows.
...so it fetches 130k nodes and doesn't keep any of them.

whereas when it has something to work with it is able to generate some
proper conditions
> SQL2 JOIN LEFT SIDE took 18 ms. fetched 145 rows.
> SQL2 JOIN RIGHT SIDE took 67 ms. fetched 0 rows.

There is room for improvement here :)
I'm thinking if the left side of the join is empty, we can skip looking for
the right side entirely.
I'll create an issue to track this improvement.

good catch!

As for the other example, we can look into it tomorrow :)

best,
(Continue reading)

Alex Parvulescu | 27 Mar 18:15 2012
Picon

Re: Strange Search Performance problem with OR

Hi,

I've created JCR-3280 [0]. It would be interesting if you could take it for
a spin ;)

best,
alex

[0] https://issues.apache.org/jira/browse/JCR-3280

On Tue, Mar 27, 2012 at 5:49 PM, Alex Parvulescu
<alex.parvulescu <at> gmail.com>wrote:

> hi,
>
> > unfortunately the debug is not that descriptive (from my POV :))
>
> yes, well the idea is to make it more user friendly :)
> I added it with joins in mind so it doesn't say much (ahem, anything)
> about how normal queries break down to lucene queries.
>
> About the join query: it appears that in the case of the 0 hits, the query
> engine is terribly inefficient, it fetches the entire subset of nodes:
>
> > SQL2 JOIN LEFT SIDE took 8 ms. fetched 0 rows.
> > SQL2 JOIN RIGHT SIDE took 845 ms. fetched 13055 rows.
> ...so it fetches 130k nodes and doesn't keep any of them.
>
> whereas when it has something to work with it is able to generate some
> proper conditions
(Continue reading)

Christian Stocker | 27 Mar 12:01 2012
Picon

Re: Strange Search Performance problem with OR

Hi again

Here's another strange issue with query performance.

I have this query:

***
SELECT data.* FROM [own:unstructured] AS data LEFT OUTER JOIN
[nzz:unstructured] AS referring ON referring.reference = data.[jcr:uuid]
WHERE
        data.[phpcr:class] = 'Own\ApiBundle\Document\Article' AND
((data.publication = 'own')
        AND (CONTAINS(data.categories, 'KURZ'))
        AND (data.departmentSlugs = 'international'))
        AND ( ((data.permissionPath  NOT LIKE 'owns/%')))
        AND (data.title IS NOT NULL AND data.title <> '') AND
(data.teaser IS NOT NULL AND data.teaser <> '') AND (
        (referring.reference IS NOT NULL AND ISDESCENDANTNODE(referring,
'/article/2012/03/26'))
        OR ISDESCENDANTNODE(data, '/article/2012/03/26')
        ) ORDER BY data.modificationDate
***

This is fast enough when it finds something. When I change for example
data.departmentSlugs = 'international'
to something else, where it certainly isn't finding anything, it's damn
slow. 50ms vs. 2'500ms or so.

I'm sure there's an explenation for this :)

(Continue reading)


Gmane