Russell Blau | 8 Aug 2012 15:46

s1 replag update, suggestion, and question

(TL;DR? Skip down three paragraphs to the possible workaround....)  Last
month, I reported on the progress of SHA-1 updates from the WMF servers,
and noted that s1 replag was likely to continue to be a problem for a
number of weeks.  As I said then, the WMF was using (at least) three
processes to populate the SHA-1 field on three separate blocks of
revision records.  All these changes then were being replicated to the
Toolserver's copies of the databases, and this flood of updates was
causing the replag.

The three blocks were being populated at different rates (for reasons
that are beyond my knowledge). On July 23 at about 15:00 UTC, rosemary
(sql-s1-rr) completed updating the first of the three blocks. The other
blocks continued to be populated (and at some point the WMF started
another process to help finish off the slowest block), but the rate of
updates was somewhat less, and rosemary actually caught up on its
backlog and reached zero replag within about a day after this milestone.

The situation on thyme (sql-s1-user) is less favorable, as we all know.
The replag on that server got much higher to start with, and thyme
didn't even reach the end of the first block until Sunday August 5 at
about 12:00 UTC. Unlike the situation with rosemary, the reduced load
after this event did not make any noticeable difference to the replag,
which has continued to increase for the past three days at much the same
rate as before.  The next milestone will be completion of the second
major block, which looks like it will occur either late on Friday August
9 or early on Saturday August 10 UTC, barring any other major problems
(like the WMF server outage on Monday which caused replication at the TS
end to stop for several hours).  At that point, the load from SHA-1
updates should be roughly about 30% of what it had been during July. One
would think that would allow the replag to drop, but since the events of
(Continue reading)

Daniel Schwen | 8 Aug 2012 17:35
Picon
Favicon

Re: s1 replag update, suggestion, and question

I'm a little confused as to which DB server we are talking about. I
need access to

enwiki-p.db.toolserver.org
hap-s1-user.esi.toolserver.org.

is that sql-s1-user or sql-s1-rr or what?

Daniel

On Wed, Aug 8, 2012 at 7:46 AM, Russell Blau <russblau <at> imapmail.org> wrote:
> (TL;DR? Skip down three paragraphs to the possible workaround....)  Last
> month, I reported on the progress of SHA-1 updates from the WMF servers,
> and noted that s1 replag was likely to continue to be a problem for a
> number of weeks.  As I said then, the WMF was using (at least) three
> processes to populate the SHA-1 field on three separate blocks of
> revision records.  All these changes then were being replicated to the
> Toolserver's copies of the databases, and this flood of updates was
> causing the replag.
>
> The three blocks were being populated at different rates (for reasons
> that are beyond my knowledge). On July 23 at about 15:00 UTC, rosemary
> (sql-s1-rr) completed updating the first of the three blocks. The other
> blocks continued to be populated (and at some point the WMF started
> another process to help finish off the slowest block), but the rate of
> updates was somewhat less, and rosemary actually caught up on its
> backlog and reached zero replag within about a day after this milestone.
>
> The situation on thyme (sql-s1-user) is less favorable, as we all know.
> The replag on that server got much higher to start with, and thyme
(Continue reading)

Russell Blau | 8 Aug 2012 17:56

Re: s1 replag update, suggestion, and question

On Wed, Aug 8, 2012, at 11:35 AM, Daniel Schwen wrote:
> I'm a little confused as to which DB server we are talking about. I
> need access to
> 
> enwiki-p.db.toolserver.org
> hap-s1-user.esi.toolserver.org.
> 
> is that sql-s1-user or sql-s1-rr or what?
> 
> Daniel

Daniel - there are two copies of enwiki-p; see [1] for details.  If you
need access to any database whose name starts with "u_" or "p_", you
need the sql-s1-user copy.  If you *don't* need access to those
databases, you ought be using the sql-s1-rr copy and you are degrading
the performance of your application if you don't.

The address "enwiki-p.db.toolserver.org" points to the sql-s1-user copy,
and is deprecated; you ought to be using either
"enwiki-p.rrdb.toolserver.org" or "enwiki-p.userdb.toolserver.org"
instead.

[1] https://wiki.toolserver.org/view/Database_access
--

-- 
  Russell Blau
  russblau <at> imapmail.org

Platonides | 8 Aug 2012 17:59
Picon

Re: s1 replag update, suggestion, and question

On Wed, Aug 8, 2012 at 3:46 PM, Russell Blau <russblau <at> imapmail.org> wrote:
> There is a possible workaround.  The TS could treat this like a server
> outage; copy user databases from thyme to rosemary and then point
> sql-s1-user to rosemary, which currently has no replag. Rosemary would
> then have to handle twice the load, but thyme should start to recover
> very quickly with no user-generated queries hitting it. Once thyme has
> recovered, point sql-s1-rr to it.
>
> Downsides: (1) this would require several hours of downtime for
> sql-s1-user while the user databases are copied; all tools that require
> access to user databases would be offline entirely for this period. (2)
> it would have to wait until our volunteer TS admins have time to do it.
Actually, it could probably be reduced from "downtime" to "readonly
user databases". If thyme were writing at the binlog, it could
probably stay accepting writes for the most part of it. This comes at
the expense of TS admin time, of course.

> (3) the added load on rosemary could cause replag to grow there,
> although I doubt it would come anywhere near the 14+ days replag we are
> dealing with now on thyme.

Depending on the insert speed without queries, another option would be
the time needed for copying the db from rosemary to thyme.
(I'm assuming it would be much slower than the downtime moving user
dbs but it's just a guess, if it weren't this could replace that
move).

Russell Blau | 10 Aug 2012 14:19

Re: s1 replag update, suggestion, and question

On Wed, Aug 8, 2012, at 11:59 AM, Platonides wrote:
> 
> Depending on the insert speed without queries, another option would be
> the time needed for copying the db from rosemary to thyme.
> (I'm assuming it would be much slower than the downtime moving user
> dbs but it's just a guess, if it weren't this could replace that
> move).
> 
Well, based on the overwhelming response to my last message, I guess
nobody but me cares if thyme is lagged by three or four or five
weeks....

Thyme finished processing the updates in the second block a few hours
ago, but the replag is continuing to increase. This is very worrisome,
and possibly there is something else going on there that the SHA-1
updates have been masking.  All the TS admins seem to be on summer
holiday; is there anyone around who has mysql root access and can look
for problems on thyme?

--

-- 
  Russell Blau
  russblau <at> imapmail.org

DaB. | 10 Aug 2012 16:37
Favicon

Re: s1 replag update, suggestion, and question

Hello,
At Friday 10 August 2012 16:34:07 DaB. wrote:
> All the TS admins seem to be on summer
> holiday; is there anyone around who has mysql root access and can look
> for problems on thyme?

I killed a few very-long-runners on thyme and AFAIS the replag is decreasing 
slowly.

Sorry for the non-response on my side the last days, but I was busy with non-
TS-stuff (and Nosy has another very-important thing to do at the moment :-))

Sincerely,
DaB.

--

-- 
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello,
At Friday 10 August 2012 16:34:07 DaB. wrote:
> All the TS admins seem to be on summer
> holiday; is there anyone around who has mysql root access and can look
> for problems on thyme?

I killed a few very-long-runners on thyme and AFAIS the replag is decreasing 
slowly.

Sorry for the non-response on my side the last days, but I was busy with non-
TS-stuff (and Nosy has another very-important thing to do at the moment :-))
(Continue reading)

Carl (CBM | 10 Aug 2012 21:42
Picon

Re: s1 replag update, suggestion, and question

On Fri, Aug 10, 2012 at 8:19 AM, Russell Blau <russblau <at> imapmail.org> wrote:
> Well, based on the overwhelming response to my last message, I guess
> nobody but me cares if thyme is lagged by three or four or five
> weeks....

I found your post very helpful for a status update of the current
situation. The lag has a huge effect on the WP 1.0 bot that is used to
track article assessments on enwiki, and which has a large user
database as well.

> Thyme finished processing the updates in the second block a few hours
> ago, but the replag is continuing to increase. This is very worrisome,
> and possibly there is something else going on there that the SHA-1
> updates have been masking.  All the TS admins seem to be on summer
> holiday; is there anyone around who has mysql root access and can look
> for problems on thyme?

Just to see if it makes any difference I killed the running WP 1.0
process on thyme.  Right now the replag seems to be decreasing at a
tiny rate, less than 10 minutes per hour.  There are 411 hours of
replag.

For what it's worth, I would personally prefer a short complete outage
(or make the server read-only) if that would leave us with no replag,
rather than waiting for weeks or months while the replag slowly
decreases.

- Carl

(Continue reading)

Russell Blau | 11 Aug 2012 13:00

Re: s1 replag update, suggestion, and question

On Fri, Aug 10, 2012, at 03:42 PM, Carl (CBM) wrote:
> 
> Just to see if it makes any difference I killed the running WP 1.0
> process on thyme.  Right now the replag seems to be decreasing at a
> tiny rate, less than 10 minutes per hour.  There are 411 hours of
> replag.
> 

Carl, that is a good idea and I've stopped all the scheduled dpl project
jobs that usually run on thyme, for 24 hours, to see if that helps.  If
anyone else could temporarily shut down their tools to help reduce the
load on the server, maybe we can all help improve the recovery rate.
--

-- 
  Russell Blau
  russblau <at> imapmail.org

Samuel Klein | 13 Aug 2012 20:32
Picon
Gravatar

Re: s1 replag update, suggestion, and question



On Fri, Aug 10, 2012 at 3:42 PM, Carl (CBM) <cbm.wikipedia <at> gmail.com> wrote:
On Fri, Aug 10, 2012 at 8:19 AM, Russell Blau <russblau <at> imapmail.org> wrote:
> Well, based on the overwhelming response to my last message, I guess
> nobody but me cares if thyme is lagged by three or four or five
> weeks....

I found your post very helpful for a status update of the current
situation. The lag has a huge effect on the WP 1.0 bot that is used to
track article assessments on enwiki, and which has a large user
database as well.

+1

<div>
<br><br><div class="gmail_quote">On Fri, Aug 10, 2012 at 3:42 PM, Carl (CBM) <span dir="ltr">&lt;<a href="mailto:cbm.wikipedia <at> gmail.com" target="_blank">cbm.wikipedia <at> gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">
<div class="im">On Fri, Aug 10, 2012 at 8:19 AM, Russell Blau &lt;<a href="mailto:russblau <at> imapmail.org">russblau <at> imapmail.org</a>&gt; wrote:<br>
&gt; Well, based on the overwhelming response to my last message, I guess<br>
&gt; nobody but me cares if thyme is lagged by three or four or five<br>
&gt; weeks....<br><br>
</div>I found your post very helpful for a status update of the current<br>
situation. The lag has a huge effect on the WP 1.0 bot that is used to<br>
track article assessments on enwiki, and which has a large user<br>
database as well.<br>
</blockquote>
<div><br></div>
<div>+1</div>
<div><br></div>
</div>
</div>

Gmane