Roan Kattouw | 1 Jul 2008 17:22
Picon
Favicon

Re: Schema change : category redirects

Nicolas Dumazet schreef:
> == Original idea : add a cl_final field ==
> <snip>
>
> * Use case #6 :
> When adding a category to a page, you have to fetch its corresponding
> cat_id for cl_to, (fairly easy), but you also have to fetch the right
> cl_final.
> You have to know first, if the category is a redirect. And if I'm
> right, with that schema, the only ways to tell this actually, are to
> retrieve the corresponding page_is_redirect in the page table, or to
> check for an entry in the redirect table.
> I believe that this forces us to compute a page-redirect join on
> page_id + a redirect-category join on page_title for each category
> title (see [3] ). If there's no results for that query, (can be caused
> if {1} the category page does not exist, {2} the category page exist
> but is not a redirect), the category is not a redirect, and else it
> returns us the cat_id for cl_final.
>   
You wouldn't need to go through the page table here if you added a 
cat_page field to the category table. It's not a big deal, though, 
because joining on page_namespace=14 AND page_title=cat_title is pretty 
cheap.
> * Use case #7 :
> Easy. SELECT ... FROM category_links WHERE cl_final = ##
>
> == But what about adding a cat_final field instead ? ==
>
> When A redirects to B, all A' category_links entries will share the
> same cl_final field. Being quite unexperienced, and very naive, I may
(Continue reading)

Simetrical | 1 Jul 2008 17:37
Picon

Re: Schema change : category redirects

On Tue, Jul 1, 2008 at 11:22 AM, Roan Kattouw <roan.kattouw@...> wrote:
> You wouldn't need to go through the page table here if you added a
> cat_page field to the category table.

Categories are not guaranteed to have associated pages, so we must
keep cat_title.  In that case cat_page is redundant and denormalized,
and should only be added if there's some specific performance benefit
to it, which there's not for any application I've heard.

> Can be done on cat_title=cl_to too, doesn't really matter. Joining on
> cat_id is cleaner of course, but I don't think it'll be any faster.

Joining on integers is considerably faster than joining on VARCHARs.
In InnoDB, joining on primary keys is considerably faster than joining
on anything else.  These shouldn't be neglected, in general.  However,
the speed of the join is not the limiting factor here, the problem is
you'll have to filesort the entire category.
Roan Kattouw | 1 Jul 2008 18:02
Picon
Favicon

Re: Schema change : category redirects

Simetrical schreef:
> On Tue, Jul 1, 2008 at 11:22 AM, Roan Kattouw <roan.kattouw@...> wrote:
>   
>> You wouldn't need to go through the page table here if you added a
>> cat_page field to the category table.
>>     
>
> Categories are not guaranteed to have associated pages, so we must
> keep cat_title.  In that case cat_page is redundant and denormalized,
> and should only be added if there's some specific performance benefit
> to it, which there's not for any application I've heard.
>   
I wasn't suggesting ditching cat_title. The fact that cat_page is 
sometimes 0 doesn't really matter, since categories without 
corresponding pages can't be redirects, so they don't have a redirect 
target (which is what the query was about). I believe there actually is 
a performance benefit to introducing cat_page, explained below.
>   
>> Can be done on cat_title=cl_to too, doesn't really matter. Joining on
>> cat_id is cleaner of course, but I don't think it'll be any faster.
>>     
>
> Joining on integers is considerably faster than joining on VARCHARs.
> In InnoDB, joining on primary keys is considerably faster than joining
> on anything else.  These shouldn't be neglected, in general.
I had the idea joining on integers would be faster, I guess I 
underestimated how much. I didn't know about the primary key thing, but 
it makes sense (it's called *primary* key for a reason).
>   However,
> the speed of the join is not the limiting factor here, the problem is
(Continue reading)

Simetrical | 1 Jul 2008 18:11
Picon

Re: Schema change : category redirects

On Tue, Jul 1, 2008 at 12:02 PM, Roan Kattouw <roan.kattouw@...> wrote:
> I had the idea joining on integers would be faster, I guess I
> underestimated how much. I didn't know about the primary key thing, but
> it makes sense (it's called *primary* key for a reason).

The primary key benefits are specific to InnoDB, since it clusters the
table data in the primary key (basically the table data is in the
leaves of the primary key B-tree, if I understand right).  A primary
key lookup is therefore one B-tree lookup instead of two.  In MyISAM,
and in many other DBMSes, the primary key isn't special.

> With the schema I was backing, yes. However, schema #1 doesn't eliminate
> the need to check for a category's redirect target when adding a page to
> it, and since joining on cat_page=page_id is faster than joining on
> cat_title=page_title (because of the int vs. varchar and primary vs.
> non-primary issues), that would constitute a "specific performance
> benefit" for adding cat_page, wouldn't it?

Not a big enough one.  Adding an extra column means every row is that
much larger, reducing key buffer efficiency and thereby hurting
performance slightly for all queries on the table.  And if you're
going to join on it you also may need an extra index (depending on
join direction), which takes time to maintain on every insert and
delete and also competes for the cache.  Plus you get the headache of
denormalization.

In this case it's almost certainly not worth it.  In the case of
cl_final, where you're avoiding cripplingly large filesorts, it's
definitely worth it, because otherwise the feature is completely
untenable.
(Continue reading)

Nicolas Dumazet | 7 Jul 2008 12:57
Picon

Re: Schema change : category redirects

Thanks, both of you, for your in-depth answers. It helped a lot :)

I have started a branch (category-redirects) to work on this. So far,
category redirects are working, and I've created a new Job subclass to
handle categorylinks big changes after a category move. Category moves
are apparently working, as far as the target category (not category
page) does not exist.

I am, however, considering to disallow category moves over an existing
category (again, the *category*, not its attached page) :
* When the user does this, he probably expects the categories to be
merged together.
* However as said before, the move has to be reversible: the proper
way to do this is to redirect the former category to the latter;
reversing this is only deleting a redirect.
* But actually, after that redirect, there will be one page for two
category objects. If we merged A and B to B, category A will redirect
to category B, while the only valid Page will be B. That might not be
a big problem, but it differs from the structure I currently use : 1
Category object <-> 1 Page object
* It gets worse if you consider a third Category C, that was
redirecting to A before the move : The user moves/merges A and B, so
he expects C to redirect to the merged categories. However, after the
move, we'll only have a redirect chain : C->A->B. And like double
redirects for pages, it wont work. I don't see a way to resolve this
use case in a reversible manner with my new schema.

Overriding move would be forbidden, and the user would have to create
a redirect. Do you see a case where the user would suffer from
creating a redirect instead of doing a hard, plain move ?
(Continue reading)

Simetrical | 7 Jul 2008 15:41
Picon

Re: Schema change : category redirects

On Mon, Jul 7, 2008 at 6:57 AM, Nicolas Dumazet <nicdumz@...> wrote:
> I am, however, considering to disallow category moves over an existing
> category (again, the *category*, not its attached page) :
> * When the user does this, he probably expects the categories to be
> merged together.
> * However as said before, the move has to be reversible: the proper
> way to do this is to redirect the former category to the latter;
> reversing this is only deleting a redirect.
> * But actually, after that redirect, there will be one page for two
> category objects. If we merged A and B to B, category A will redirect
> to category B, while the only valid Page will be B. That might not be
> a big problem, but it differs from the structure I currently use : 1
> Category object <-> 1 Page object

So you're saying that basically, just as you can't move a page over an
existing page, but have to manually create the redirect, so too here
you can't move a category over an existing category, but have to
manually create a redirect.  If I got that right, it seems reasonable
to me.

> There is a little problem with that constraint, though : if a category
> once contained pages, it will still exist. It means that the user can
> be asked not to move a category to an... empty category.
> The easy answer here is "just check for category membership, and allow
> the move when no categorylinks entries point to the target category";
> however, how will we deal with CategoryLinksUpdate Jobs that might be
> pending ? In other words, the target category might appear empty at
> the query time, when a move / redirect change is undergo.
> Should I try to address that ? If on moves, I make the job_title the
> target category title, searching the job table for
(Continue reading)


Gmane