Lars Magne Ingebrigtsen | 9 Mar 22:07 2012
Face
Picon

News server switchover sometime next week

As detailed in

http://lars.ingebrigtsen.no/2012/03/march-9th-2012.html

I've now finished benchmarking the various solutions, and am now
rsyncing over the entire Gmane spool to the new machine.  I expect to do
the actual switchover perhaps next weekend?  We'll see.  There'll be
some downtime, but it shouldn't be massive...

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/
Lars Magne Ingebrigtsen | 7 Apr 20:00 2012
Face
Picon

Re: News server switchover sometime next week

Lars Magne Ingebrigtsen <larsi <at> gnus.org> writes:

> I've now finished benchmarking the various solutions, and am now
> rsyncing over the entire Gmane spool to the new machine.  I expect to do
> the actual switchover perhaps next weekend?  We'll see. 

Four weeks later, the sync still hasn't finished.  *sigh*  Well,
actually, I just started it two weeks ago, but still.

This isn't even about actually doing the sync.  I rsynced over 99% of
the spool from the backup spool to the new server two weeks ago.  Then I
started the final "mop-up" stuff, that basically just does a readdir on
the production server and the new server, and copies over the few
missing messages that arrived during the initial rsync.

But doing a readdir on, say, gmane.comp.kde.bugs, which has 600K files
in it, takes hours.  HOURS!  I'm serious.  It's been doing a readdir on
that group since I got up at noon, and it's now eight at night.

The good news is that that group is group number 10928, so there's now
just, er, 10K groups to go.  The better news is that the groups grow
smaller the newer the groups are, in general, so I would guesstimate
that it should be finished in about a week.  Or two.

At that point, I'll do the switchover and news.gmane.org should be
reliably usable again.

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/
(Continue reading)

Lars Magne Ingebrigtsen | 7 Apr 20:09 2012
Face
Picon

Re: News server switchover sometime next week

Oh, and I did some minor touch-ups to the statistics on the front page
(http://gmane.org/).  The stats now exclude all Gwene traffic.  So there
are 13K Gmane groups and 7K Gwene groups.

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/
Tim Landscheidt | 7 Apr 22:44 2012
Picon

Re: News server switchover sometime next week

Lars Magne Ingebrigtsen <larsi <at> gnus.org> wrote:

>> I've now finished benchmarking the various solutions, and am now
>> rsyncing over the entire Gmane spool to the new machine.  I expect to do
>> the actual switchover perhaps next weekend?  We'll see.

> Four weeks later, the sync still hasn't finished.  *sigh*  Well,
> actually, I just started it two weeks ago, but still.

> This isn't even about actually doing the sync.  I rsynced over 99% of
> the spool from the backup spool to the new server two weeks ago.  Then I
> started the final "mop-up" stuff, that basically just does a readdir on
> the production server and the new server, and copies over the few
> missing messages that arrived during the initial rsync.

> But doing a readdir on, say, gmane.comp.kde.bugs, which has 600K files
> in it, takes hours.  HOURS!  I'm serious.  It's been doing a readdir on
> that group since I got up at noon, and it's now eight at night.

> The good news is that that group is group number 10928, so there's now
> just, er, 10K groups to go.  The better news is that the groups grow
> smaller the newer the groups are, in general, so I would guesstimate
> that it should be finished in about a week.  Or two.

> At that point, I'll do the switchover and news.gmane.org should be
> reliably usable again.

Eh, aren't messages still arriving, so you would have to do
the mop-up before the switchover *again*?  Or do you set a
group to read-only, rsync, change the storage location to
(Continue reading)

Lars Magne Ingebrigtsen | 7 Apr 23:36 2012
Face
Picon

Re: News server switchover sometime next week

Tim Landscheidt <tim <at> tim-landscheidt.de> writes:

> Eh, aren't messages still arriving, so you would have to do
> the mop-up before the switchover *again*?

Nope.  There's a separate log-file-based backup process that copies over
articles as they arrive.  That's how the various mirrors are
maintained.  The log-file-based thing doesn't extend that far back into
the past, though...

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/
Lars Magne Ingebrigtsen | 8 Apr 02:08 2012
Face
Picon

Re: News server switchover sometime next week

D'oh.

So, basically, I'm doing the mop-up sweep by doing a readdir on the
destination (i.e., /mirror/var/spool/news/articles/gmane/discuss, for
instance), and then on the source (an nfs-mounted
/var/spool/news/articles/gmane/discuss), and then seeing what files are
missing/should be deleted.

The idea being that readdir would need no stat()-ing, so it should be
fast.

But then I thought...  what if the main problem here is that readdir
over NFS is just bone slow?  Reading huge directories over NFS probably
isn't the most benchmarked thing in the world.

So I tried mounting via sshfs instead.  Faster, but still kinda slow.
One positive effect was that the load on the server went down from 16 to
7, though, so, hey, win!

Then I straced sshfs (i.e., sftp-server) just for kicks, just to see
what it's doing while doing a readdir.  It did this fucking thing for
every file in the directory!

gettimeofday({1333843231, 411240}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
lstat("/var/spool/news/articles/gmane/discuss/8111", {st_mode=S_IFREG|0644, st_size=4502,
...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 6
lseek(6, 0, SEEK_CUR)                   = 0
(Continue reading)

Lars Magne Ingebrigtsen | 8 Apr 18:26 2012
Face
Picon

Re: News server switchover sometime next week

So.  It finished the sweep last night.  To check I did:

ger:~# df -ih /var/spool/news/ /mirror/var/spool/news/
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
dough:/var/spool/news
                        216M    130M     86M   61% /var/spool/news
/dev/sdb1               239M    115M    125M   48% /mirror/var/spool/news

Urr.  There are 15M articles less on the new server.  OMG!  Did my new
version of the sync script delete a bunch of articles?

But, no, the missing articles are from the old version of the script.
For instance, gmane.linux.kernel has 700K fewer articles than on the old
server, and that group was sweeped a week ago, at least.

Looking in the kernel log, I find:

[892617.826777] NFS: directory linux/kernel contains a readdir loop.Please contact your server vendor

*sigh*

So, doing readdir over NFS that takes hours, on a directory that changes
(since it gets new messages written to it?), leads NFS to think that
there are readdir loops?  So readdir (probably) returned a truncated
list, which led the sync script to delete a bunch of articles on the
mirror.

That's my guess.

So I've started the new version of the sweep on all the groups again,
(Continue reading)

Jeff Grossman | 8 Apr 18:36 2012

Re: News server switchover sometime next week

On Sun, 08 Apr 2012 18:26:44 +0200, Lars Magne Ingebrigtsen wrote:

>So.  It finished the sweep last night.  To check I did:
>
>ger:~# df -ih /var/spool/news/ /mirror/var/spool/news/
>Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>dough:/var/spool/news
>                        216M    130M     86M   61% /var/spool/news
>/dev/sdb1               239M    115M    125M   48% /mirror/var/spool/news
>
>Urr.  There are 15M articles less on the new server.  OMG!  Did my new
>version of the sync script delete a bunch of articles?
>
>But, no, the missing articles are from the old version of the script.
>For instance, gmane.linux.kernel has 700K fewer articles than on the old
>server, and that group was sweeped a week ago, at least.
>
>Looking in the kernel log, I find:
>
>[892617.826777] NFS: directory linux/kernel contains a readdir loop.Please contact your server vendor
>
>*sigh*
>
>So, doing readdir over NFS that takes hours, on a directory that changes
>(since it gets new messages written to it?), leads NFS to think that
>there are readdir loops?  So readdir (probably) returned a truncated
>list, which led the sync script to delete a bunch of articles on the
>mirror.
>
>That's my guess.
(Continue reading)

Lars Magne Ingebrigtsen | 10 Apr 10:04 2012
Face
Picon

Re: News server switchover sometime next week

So.  What happened this time?

ger:~# df -ih /mirror/var/spool/news/articles/ /var/spool/news/articles/
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb1               239M    122M    117M   52% /mirror/var/spool/news
dough:/var/spool/news
                        216M    131M     86M   61% /var/spool/news

We were up to 131M messages, and then dropped down to 122M in the final
final mop-up sweep.

And that's because inetd helpfully has a 256 connections per minute
default, and my script didn't check for connection errors.  When it
wasn't able to connect, it deleted all messages in the groups in
question.

*sigh*

Again.

Syncing again...

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/

Gmane