Nick Hill | 1 Jan 15:16 2004
Picon

New system ideas

I throught I would throw some ideas of configurations for new wikipedia
systems:

Assuming we have a fast and reliable infrastructure for wikipedia to
operate on, I would hope and expect many more people to benefit from the
gret project. The new hardware configuration needs to have redundancy
built in and reliability.

The DNS system is, by nature, distributed and designed to have many
systems adding redundancy to the resolution service.

A single ip address will always be a single point of failure as the
routers leading up to the physical ip destination will take some time to
propogate a new physical destination for the ip address.

It therefore makes sense to have redundancy switching at the DNS level
and having systems located at different network locations offering
equivalent service.

This brings issues of updates and authentication into the question. The
slave machine cannot authenticate or take database updates. A mechanism
is needed to automatically nominate a machine as master or slave. The
DNS system and each machine would respond to this.

Each wiki server, when booted, will, by default, be a slave. Each wiki
server will periodically request master status from the arbitration
server. The arbitration server can have an algorithm thus:

a)Each wiki server requests master status every 5 minutes.
b)The arbitration/master DNS server checks whether the wiki server
(Continue reading)

Nick Hill | 1 Jan 17:20 2004
Picon

Re: New system ideas

Nick Hill wrote:

> A specific host name is used for all updates. 

For example, updates.wikipedia.org would receive form data from any 
update. Which physical server/ ip address this goes to depends upon the 
result of the negotiation system I previously described.
>The 'master hostname'.
> After each grant, the ip address of the master hostname is compared with
> the ip address of the machine which just received a grant.

To put it clearer, if a different machine has just been granted master 
status, the ip address which updates.wikipedia.org resolves to will 
change to the ip address of the machine which just been granted master 
status. Bind reloads and the change is pushed to the secondary DNS servers.

> If it
> differs, the ip address of the master hostname is changed. This change
> is pushed to the slave DNS servers using ordinary BIND8 protocol. The
> DNS TTL for the domain name of the master wiki server would be set to
> 300 seconds. If a master went down for any reason, everyone should see
> the alternative master within 20 minutes.

Each wiki server has a cron job running which periodically runs a shell 
script. This shell script first checks the web server and database are 
running as expected by performing an http get command and comparing the 
result of the get with the expected result. If they match, the database 
and web server are considered working. If this fails, the script exits. 
If the script keeps exiting, the arbitration server will never receive 
notification the machine works. The arbitration server will remove the 
(Continue reading)

Nick Hill | 1 Jan 19:57 2004
Picon

Re: New system ideas

I have uploaded a diagram in pdf and Openoffice draw illustrating the 
redundant system idea.

http://www.nickhill.co.uk/wikipedia-redundant-system.pdf
http://www.nickhill.co.uk/wikipedia-redundant-system.sxd
Timwi | 6 Jan 06:59 2004
Picon
Picon

Re: New system ideas


Nick Hill wrote:

> I have uploaded a diagram in pdf and Openoffice draw illustrating the 
> redundant system idea.
> 
> http://www.nickhill.co.uk/wikipedia-redundant-system.pdf

Hm. You said you were trying to eliminate single points of failure, but 
in this system it seems that what you call the "Arbitration Script" 
would be such a single point of failure?

I'm probably missing something. Sorry if I am.

Timwi
Nick Hill | 6 Jan 13:15 2004
Picon

Re: Re: New system ideas

Timwi wrote:
> 
> Nick Hill wrote:
> 
>> I have uploaded a diagram in pdf and Openoffice draw illustrating the 
>> redundant system idea.
>>
>> http://www.nickhill.co.uk/wikipedia-redundant-system.pdf
> 
> 
> Hm. You said you were trying to eliminate single points of failure, but 
> in this system it seems that what you call the "Arbitration Script" 
> would be such a single point of failure?
> 
> I'm probably missing something. Sorry if I am.

If the arbitration script failed, the master will cease to accept 
updates but all the wikis will still be available for read-only access.
Jens Frank | 2 Jan 17:27 2004
Picon
Picon

Re: New system ideas

On Thu, Jan 01, 2004 at 02:16:21PM +0000, Nick Hill wrote:
> I throught I would throw some ideas of configurations for new wikipedia
> systems:
> 
> Assuming we have a fast and reliable infrastructure for wikipedia to
> operate on, I would hope and expect many more people to benefit from the
> gret project. The new hardware configuration needs to have redundancy
> built in and reliability.
> 
> The DNS system is, by nature, distributed and designed to have many
> systems adding redundancy to the resolution service.
> 
> [....]

The idea of having a dedicated http server for updates relies on DNS
entries with a very short time to live (TTL). Else DNS servers all
over the world would cache the DNS entry for updates.wikipedia.org
for a long time. Most DNS servers handle entries with a short TTL
correctly, but several browsers don't. They cache IP adresses, but
don't honor the TTL. Mozilla is one of those. My home box has a 
dyndns.org hostname and Mozilla caches the IP for several days
despite the TTL being 5 minutes or so.

I think it's better to have the web servers load balanced by a tool
like http://www.linuxvirtualserver.org/ and have MySQL database 
replication between two database servers.

Regards,

	JeLuF
(Continue reading)

Nick Hill | 2 Jan 20:58 2004
Picon

Re: New system ideas

Jens Frank wrote:

> The idea of having a dedicated http server for updates relies on DNS
> entries with a very short time to live (TTL). Else DNS servers all
> over the world would cache the DNS entry for updates.wikipedia.org
> for a long time. Most DNS servers handle entries with a short TTL
> correctly, but several browsers don't. They cache IP adresses, but
> don't honor the TTL. Mozilla is one of those. My home box has a 
> dyndns.org hostname and Mozilla caches the IP for several days
> despite the TTL being 5 minutes or so.

A way to overcome this problem is to append a random fourth level domain 
name to the URLs pointing to the update server. This way, each time the 
update server is referenced, the ip address is refreshed.

eg
Where <anything> is replaced by a random string:

http://<anything>.update.wikipedia.org/ resolves via a wildcard DNS 
entry to the current ip address of update.wikipedia.org. This way, 
whenever an update link is pressed, the most up to date IP address is 
fetched from the DNS.
Tim Thorpe | 2 Jan 21:59 2004

RE: New system ideas

What about sticking the entire cluster on private IP's and having a load
balancing firewall appliance handle traffic flow which would randomly hit
the update box?

-----Original Message-----
From: wikitech-l-bounces@...
[mailto:wikitech-l-bounces@...] On Behalf Of Nick Hill
Sent: Friday, January 02, 2004 11:58 AM
To: Wikimedia developers
Subject: Re: [Wikitech-l] New system ideas

Jens Frank wrote:

> The idea of having a dedicated http server for updates relies on DNS
> entries with a very short time to live (TTL). Else DNS servers all
> over the world would cache the DNS entry for updates.wikipedia.org
> for a long time. Most DNS servers handle entries with a short TTL
> correctly, but several browsers don't. They cache IP adresses, but
> don't honor the TTL. Mozilla is one of those. My home box has a 
> dyndns.org hostname and Mozilla caches the IP for several days
> despite the TTL being 5 minutes or so.

A way to overcome this problem is to append a random fourth level domain 
name to the URLs pointing to the update server. This way, each time the 
update server is referenced, the ip address is refreshed.

eg
Where <anything> is replaced by a random string:

http://<anything>.update.wikipedia.org/ resolves via a wildcard DNS 
(Continue reading)

Jens Frank | 2 Jan 23:28 2004
Picon
Picon

Re: New system ideas

On Fri, Jan 02, 2004 at 12:59:14PM -0800, Tim Thorpe wrote:
> What about sticking the entire cluster on private IP's and having a load
> balancing firewall appliance handle traffic flow which would randomly hit
> the update box?

I think Nick is trying to have a global cluster, and you can not achieve
this with a simple load balancer. If one data center goes down, the load
balancer would go down, too.

DNS has ways to handle outage of a server, and Nick tries to use these.

JeLuF
Jimmy Wales | 5 Jan 19:41 2004

Re: New system ideas

Jens Frank wrote:
> > What about sticking the entire cluster on private IP's and having a load
> > balancing firewall appliance handle traffic flow which would randomly hit
> > the update box?
> 
> I think Nick is trying to have a global cluster, and you can not achieve
> this with a simple load balancer. If one data center goes down, the load
> balancer would go down, too.

That's right.  Nick's proposal, though, solves a problem that we don't
actually have.  It is of course possible for an entire data center to
go down, but it's extremely rare and should be close to the bottom of
our list of problems to solve.

The much more likely scenario -- one that we live with every day -- is
that a single webserver will fall over.  After we get that problem
solved, with a load balancing firewall appliance basically,
linuxvirturalserver.org stuff, then we can also consider such things
as global clustering and locating servers all over the planet or
whatever.

--Jimbo
Nick Hill | 3 Jan 00:34 2004
Picon

Re: New system ideas

Tim Thorpe wrote:
> What about sticking the entire cluster on private IP's and having a load
> balancing firewall appliance handle traffic flow which would randomly hit
> the update box?

There would be several points of failure with such a system.

To make a system really reliable, all single points of failure need to 
be removed.

Eg: Building specific hazards- power loss (UPS sometimes fail), network 
cables cut, fire, burglary, landlord reposession, hosting company 
bankruptcy, malicious attack, human error, plane crash etc.

Machine specific hazards- any single machine failing in the system 
bringing everything down- either hardware failure or malicious attack.

Any single segment of the network failing bringing the system down- 
hardware failure, human error or malicious attack.

I believe a design philosophy where the system is immune from any single 
element failing is both the most cost-effective and the most reliable. 
Rather than invest heavily for reliability in mission-critical systems, 
make no system mission critical. No system then needs to have 
mission-critical investment. The overall system will then be cheaper and 
more reliable.

To put it another way:
All systems will fail. The probability of a single reliable costly unit 
failing is still fairly high. The probability of many fairly reliable 
(Continue reading)

Gabriel Wicke | 3 Jan 11:29 2004
Picon

Re: New system ideas

On Fri, 02 Jan 2004 23:34:14 +0000, Nick Hill wrote:

> The probability of many fairly reliable 
> cheap units with no common point of failure breaking down 
> simaultaneously is much lower than the probability of a costly reliable 
> unit failing.

I agree with all you're saying and like the thought of having a global
cluster with arbitration, but i have some doubts:

* What's the minimum hardware capable of running the databases, the
webserver, the cache etc? Is all this possible on a cheap unit while still
being fast? I would expect a RAM requirement of at least 4Gb, but i might
be wrong. This would certainly increase once more languages start to grow,
so it might be necessary to have separate machines for separate languages.

* With the number of nodes increasing, replication traffic might be fairly
high (imagine mass undoing somebody's changes replicated to ten machines)

* encryption of replication traffic will drain the cpu, even a simple scp
does this- imagine the same for ten streams

> If no single machine is critical and machines are widely separated, we 
> would not even need to worry whether the machines are equipped with UPS 
> or redundant supplies.

If the switchover is quick, this would be perfect- no need for separate
backups and so on. 

To get an idea of the hardware requirements it would be nice if somebody
(Continue reading)

Nick Hill | 3 Jan 12:41 2004
Picon

Re: Re: New system ideas

Gabriel Wicke wrote:

> I agree with all you're saying and like the thought of having a global
> cluster with arbitration, but i have some doubts:
> 
> * What's the minimum hardware capable of running the databases, the
> webserver, the cache etc? Is all this possible on a cheap unit while still
> being fast? I would expect a RAM requirement of at least 4Gb, but i might
> be wrong. This would certainly increase once more languages start to grow,
> so it might be necessary to have separate machines for separate languages.

This depends on the size of data sets. The most busy fine grained data 
can be held in memory, eg a ramdisk. A machine with dual 64Bit Opteron, 
8Gb ram and an Eric remote administration card weighs in at around 
US$4500. The machine can be upgraded to 16Gb. 64 bit is necessary for 
machines of over 4Gb. 4Gb is an addressable limit for 32bit.

I posted a possible hardware config to wikitech-l on 01/01/04 14:16.

> 
> * With the number of nodes increasing, replication traffic might be fairly
> high (imagine mass undoing somebody's changes replicated to ten machines)
> 
> * encryption of replication traffic will drain the cpu, even a simple scp
> does this- imagine the same for ten streams

A compressed SCP connection using blowfish cypher for English text 
between two AMD Athlon XP2200+ CPUs gives a throughput of 
3.3Megabytes/sec. The majority of the load is on the sending machine. A 
dual Opteron 240 with a 64 bit optimised cipher algorithm can probably 
(Continue reading)

Gabriel Wicke | 3 Jan 13:57 2004
Picon

Re: Re: New system ideas

On Sat, 03 Jan 2004 11:41:59 +0000, Nick Hill wrote:

> Do we have means to replay typical wikipedia activity from a log file? I 
> am thinking of a PERL script which reads a real wikipedia common log 
> file replaying the same load pattern at defineable speeds.
> 
> If someone has a pre-configured server image and some real log files, I 
> can do this.
> 
> I have already written a simple PERL script which can be modified to 
> replay server load in real time from log files.

http://www.cs.virginia.edu/~rz5b/software/logreplayer-manual.htm
could be useful, or http://opensta.org/.

Even more useful would be a full incoming traffic dump and a full disk
image of the start condition (it's mainly the editing that is
interesting). tcpdump and tcpreplay might be good candidates for this.

If it turns out that the minimum server for the current database
indeed would be a 64bit, 8Gb 'cheap' machine- then it might get pretty
hard when the database grows (especially with multiple languages etc).
Just an import of http://www.meyers-konversationslexikon.de/ into the
german wikimedia (not to think of its nice illustrations) would create
problems i guess.

> A compressed SCP connection using blowfish cypher for English text 
> between two AMD Athlon XP2200+ CPUs gives a throughput of 
> 3.3Megabytes/sec. The majority of the load is on the sending machine. A 
> dual Opteron 240 with a 64 bit optimised cipher algorithm can probably 
(Continue reading)

Tim Thorpe | 3 Jan 20:37 2004

RE: New system ideas

The obstacle is the DB server, to have an offsite dump 24/7 of the DB server
would effective double the bandwidth used for each transaction. I work for
an ISP N+1 takes care of most DC internal issues, the DC that is being used
is Verio which is a VERY large hosting company, I don't see them going down
any time soon. Some advanced routers have a dial up redundancy capability
where it can phone on a separate land line an off-site router to inform it
that there is a network issue on one end and that all data needs to be
re-routed to the secondary stack.

You can never eliminate all single points of failure, but I also see that
some of us are loosing our heads when it comes to solutions. Some suggested
solutions would cost not in the tens of thousands but hundreds of thousands
to implement.

Remember the golden rule to network engineering, KEEP IT SIMPLE STUPID! ;).

-----Original Message-----
From: wikitech-l-bounces@...
[mailto:wikitech-l-bounces@...] On Behalf Of Nick Hill
Sent: Friday, January 02, 2004 3:34 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] New system ideas

Tim Thorpe wrote:
> What about sticking the entire cluster on private IP's and having a load
> balancing firewall appliance handle traffic flow which would randomly hit
> the update box?

There would be several points of failure with such a system.

(Continue reading)

Gabriel Wicke | 3 Jan 22:04 2004
Picon

RE: New system ideas

On Sat, 03 Jan 2004 11:37:04 -0800, Tim Thorpe wrote:

> The obstacle is the DB server, to have an offsite dump 24/7 of the DB
> server would effective double the bandwidth used for each transaction. I
> work for an ISP N+1 takes care of most DC internal issues, the DC that is
> being used is Verio which is a VERY large hosting company, I don't see
> them going down any time soon. Some advanced routers have a dial up
> redundancy capability where it can phone on a separate land line an
> off-site router to inform it that there is a network issue on one end and
> that all data needs to be re-routed to the secondary stack.
> 
> You can never eliminate all single points of failure, but I also see that
> some of us are loosing our heads when it comes to solutions. Some
> suggested solutions would cost not in the tens of thousands but hundreds
> of thousands to implement.
> 
> Remember the golden rule to network engineering, KEEP IT SIMPLE STUPID!
> ;).

I agree.

The idea of a WikiTella thing is fascinating but seems to be very hard to
implement now.

Factors that would help WikiTella:
* hardware performance grows really fast
* bandwidth gets really cheap

If somebody would manage to get a prototype of this working under load and
with little bandwidth requirements i would be all for this solution, but i
(Continue reading)

Jimmy Wales | 5 Jan 19:31 2004

Re: New system ideas

Nick Hill wrote:
> A single ip address will always be a single point of failure as the
> routers leading up to the physical ip destination will take some time to
> propogate a new physical destination for the ip address.
> 
> It therefore makes sense to have redundancy switching at the DNS level
> and having systems located at different network locations offering
> equivalent service.

At some point, yes, I agree completely.  But it's important that we
focus our energy on solving problems that we actually have, rather
than problems which are small and unlikely.  We have no plans to host
anywhere other than professional facilities with full redundancy of
everything important, and so the chances of a failure of the type you
are imagining impacting us are very small.

To improve reliability, it makes sense to focus our efforts on the
sorts of things that are most likely to go wrong, first.  And then,
later, if we can afford it and if it makes economic sense, we can
focus on the more esoteric risks.

> Dual 64 bit Opteron mainboards are available with 8 DIMM sockets capable
> of taing 16Gb. eg:
> http://www.tyan.com/products/html/thunderk8w_spec.html

Right, we have one of those (though broken at the moment).  Right now,
our entire db should fit comfortably in 4 gig of Ram, which we have.

But I totally agree with you about the importance of avoiding the
hard drive as much as possible.
(Continue reading)

Nikos-Optim | 5 Jan 22:04 2004
Picon

Re: New system ideas


> > Dual 64 bit Opteron mainboards are available with
> 8 DIMM sockets capable
> > of taing 16Gb. eg:
> >
>
http://www.tyan.com/products/html/thunderk8w_spec.html
> 
> Right, we have one of those (though broken at the
> moment).  Right now,
> our entire db should fit comfortably in 4 gig of
> Ram, which we have.

> --Jimbo

thre broken mobby is a Tyan??

__________________________________
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/
Ricky Beam | 5 Jan 22:21 2004
Picon

Re: New system ideas

On Mon, 5 Jan 2004, Nikos-Optim wrote:
>thre broken mobby is a Tyan??

Where UPS is involved, nothing is safe.  These things happen. (I'll assume
someone checked Tyan's approved memory list to make sure it's not just
something lame with the memory making it usable on that MB.)

--Ricky
Jimmy Wales | 5 Jan 22:57 2004

Re: New system ideas

Ricky Beam wrote:
> On Mon, 5 Jan 2004, Nikos-Optim wrote:
> >thre broken mobby is a Tyan??
> 
> Where UPS is involved, nothing is safe.  These things happen. (I'll assume
> someone checked Tyan's approved memory list to make sure it's not just
> something lame with the memory making it usable on that MB.)

Actually, I'm not 100% sure what the mobo is, and Jason's not around
right now for me to ask.  My point was, we have a motherboard that
can accept up to 16 gig of ram.  We have 4 gig, and plenty of slots
available for further expansion.

And yes, the Ram is perfectly fine, and Jason has tested it.  What he
discovered is that one slot on the mobo is bad.  So Penguin is sending
a new one for us.

--Jimbo
Jason Richey | 6 Jan 10:08 2004

Re: New system ideas

The MB is an Arima HDAMA...http://www.rioworks.com/Download/HDAMA.htm

Jason

Jimmy Wales wrote:

> Ricky Beam wrote:
> > On Mon, 5 Jan 2004, Nikos-Optim wrote:
> > >thre broken mobby is a Tyan??
> > 
> > Where UPS is involved, nothing is safe.  These things happen. (I'll assume
> > someone checked Tyan's approved memory list to make sure it's not just
> > something lame with the memory making it usable on that MB.)
> 
> Actually, I'm not 100% sure what the mobo is, and Jason's not around
> right now for me to ask.  My point was, we have a motherboard that
> can accept up to 16 gig of ram.  We have 4 gig, and plenty of slots
> available for further expansion.
> 
> And yes, the Ram is perfectly fine, and Jason has tested it.  What he
> discovered is that one slot on the mobo is bad.  So Penguin is sending
> a new one for us.
> 
> --Jimbo
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@...
> http://mail.wikipedia.org/mailman/listinfo/wikitech-l

--

-- 
(Continue reading)


Gmane