Yuri Astrakhan | 7 Jul 2006 16:44
Picon

Re: fwd: wikicnt_daemon.pl

*You* need to take responsibility for checking that *your* tools
aren't causing *our* server to die. You've got to be reasonable and
make sure you glance at it all periodically.

Rob Church
 
Rob, I agree with most of what you said, except for the last statement - we are all developers donating our time to the common cause. We all make mistakes, and we help each other to fix them. Edward accepts responsibility by figuring out what's wrong and trying to fix it. Saying *your* tool on *our* server is, in my opinion, improper.
 
--Yuri

 
Rob Church | 7 Jul 2006 16:46
Picon

Re: fwd: wikicnt_daemon.pl

On 07/07/06, Yuri Astrakhan <yuriastrakhan <at> gmail.com> wrote:
> Rob, I agree with most of what you said, except for the last statement - we
> are all developers donating our time to the common cause. We all make
> mistakes, and we help each other to fix them. Edward accepts responsibility
> by figuring out what's wrong and trying to fix it. Saying *your* tool on
> *our* server is, in my opinion, improper.

Well, in MY opinion, the "common cause" sounds disgusting, Communistic and foul.

Good job we can't be persecuted for our opinions where I live.

Rob Church

Yuri Astrakhan | 7 Jul 2006 17:18
Picon

Re: fwd: wikicnt_daemon.pl

Prosecuted for opinions - no,
Scolded for childish behavior – yes

By saying "*our* server", you divided community between Edward and everyone else, and assigned yourself to be the official representative of "everyone else". Please first obtain such mandate, and only then make such statements. For now, I think you can only make statements as you, Rob, not as the group. Royal *we* is a bit out of fashion.


On 7/7/06, Rob Church <robchur <at> gmail.com> wrote:
On 07/07/06, Yuri Astrakhan <yuriastrakhan <at> gmail.com> wrote:
> Rob, I agree with most of what you said, except for the last statement - we
> are all developers donating our time to the common cause. We all make
> mistakes, and we help each other to fix them. Edward accepts responsibility
> by figuring out what's wrong and trying to fix it. Saying *your* tool on
> *our* server is, in my opinion, improper.

Well, in MY opinion, the "common cause" sounds disgusting, Communistic and foul.

Good job we can't be persecuted for our opinions where I live.


Rob Church
_______________________________________________
Toolserver-l mailing list
Toolserver-l <at> Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Rob Church | 7 Jul 2006 17:21
Picon

Re: fwd: wikicnt_daemon.pl

On 07/07/06, Yuri Astrakhan <yuriastrakhan <at> gmail.com> wrote:
>  By saying "*our* server", you divided community between Edward and everyone
> else, and assigned yourself to be the official representative of "everyone
> else". Please first obtain such mandate, and only then make such statements.
> For now, I think you can only make statements as you, Rob, not as the group.
> Royal *we* is a bit out of fashion.

Shall we drop the pedantic pissing about and get back to the real issue, then?

Rob Church

Daniel Kinzler | 7 Jul 2006 21:23
Picon
Favicon
Gravatar

Re: fwd: wikicnt_daemon.pl

> Shall we drop the pedantic pissing about and get back to the real issue, then?

Yes, please.

I talked to Leon about ways to make hit counters feasible, for all
projects. The core points are:

* Just like Edward did, use JS code trigger an HTTP request on page
views. But this should be throttled to a probability of 1% - or, for
large projects, 0.1%. This should still give us usable stats for the
most popular pages.

* Just like Edward, use a persistent server, not cgi/php. To avoid
exposing home brewen hacks to the wild web, we should stick to something
tried and true. I suggested to implement it as a Java servlet. Should be
fairly straight forward, and we have Tomcat running anyway.

* To get around latency issues with the database, don't spawn (cause
more load on the already troubled DB); instead, cache updates in RAM for
a minute or so, the flush the into the db in a single insert.

* Edward used a lot of ram for a name -> id mapping. This should be
avoided - the name is unique, we don't need the page ID. If we want the
ID, it should be determined on the wikipedia server and supplied with
the request - I talked to Tim Starling about making this and other
useful things available as JS variables.

Perhaps Edward and Leon can work on this together. In any case, I would
suggest to throttle updates from the ruwiki to 1% of page hits, *if* the
page counter is to be enabled again. something like this should do:

if (round(random()*100)=1)...

Regards,
-- Daniel

--

-- 
Homepage: http://brightbyte.de

Leon Weber | 7 Jul 2006 21:29
Picon

Re: fwd: wikicnt_daemon.pl

Daniel Kinzler schrieb:
>> Shall we drop the pedantic pissing about and get back to the real issue, then?
>>     
>
> Yes, please.
>
> I talked to Leon about ways to make hit counters feasible, for all
> projects. The core points are:
> [...]
Yeah. I wanted to create these stats for dewiki, so just like
duesentrieb said, let the js just call the page with a probability of
somethin maybe less than 1/1000. But NullC gave me a great idea now:
I'll just let the js call an empty text file and collect the stats from
the apachelogs. There's no more efficient way.

Leon

Gregory Maxwell | 7 Jul 2006 21:52
Picon
Gravatar

Re: fwd: wikicnt_daemon.pl

On 7/7/06, Leon Weber <leon.weber <at> leonweber.de> wrote:
> Yeah. I wanted to create these stats for dewiki, so just like
> duesentrieb said, let the js just call the page with a probability of
> somethin maybe less than 1/1000. But NullC gave me a great idea now:
> I'll just let the js call an empty text file and collect the stats from
> the apachelogs. There's no more efficient way.

To clarify, IFF you disabled logging and you wrote an apache module to
maintain a counter in shared memory using an efficient datastructure
(perhaps a Judy array on page titles) and worked out the locking
issues... it would be faster.

However, apache logging is async and append only.  It's the simplest
form of writing that could happen, and if we moved the logs into tmpfs
it would likely be darn close to optimal.

Although toolserver is disk bound, what is killing us is random seeks
(see iostat, we are constantly pegged at 350-400 TPS but moving less
than 6MB/sec)... so I would not expect much problems from an async and
append only writer because all of it's activity will be mostly
sequential.

Marco Schuster | 7 Jul 2006 21:58
Picon

Re: fwd: wikicnt_daemon.pl

Leon Weber schrieb:
> Daniel Kinzler schrieb:
> 
>>>Shall we drop the pedantic pissing about and get back to the real issue, then?
>>>    
>>
>>Yes, please.
>>
>>I talked to Leon about ways to make hit counters feasible, for all
>>projects. The core points are:
>>[...]
> 
> Yeah. I wanted to create these stats for dewiki, so just like
> duesentrieb said, let the js just call the page with a probability of
> somethin maybe less than 1/1000. But NullC gave me a great idea now:
> I'll just let the js call an empty text file and collect the stats from
> the apachelogs. There's no more efficient way.
Problem is: How do you get the article name from the textfile?
Anyway, it would be a nice idea to put webalizer ot other stats tool on 
the toolserver.

Greets,
Marco

By the way, Leon had this topic on CC with his email adress, I put him 
out so he doesn't get duplicate mails.

Edward Chernenko | 8 Jul 2006 05:08
Picon
Gravatar

Re: fwd: wikicnt_daemon.pl

2006/7/7, Daniel Kinzler <daniel <at> brightbyte.de>:
>
> I talked to Leon about ways to make hit counters feasible, for all
> projects. The core points are:
>
> * Just like Edward did, use JS code trigger an HTTP request on page
> views. But this should be throttled to a probability of 1% - or, for
> large projects, 0.1%. This should still give us usable stats for the
> most popular pages.

ruwiki TOP100 script shows about 300 hits for last (#100) place. It's
better to handle at least 5-10% of all requests.

There's another optimization on client-side: my counter filtered any
request to history, diff-s, pages not from article namespace etc. This
should be added into JS script (sorry, I can't do so right now because
I have no sysop rights).

>
> * Just like Edward, use a persistent server, not cgi/php. To avoid
> exposing home brewen hacks to the wild web, we should stick to something
> tried and true. I suggested to implement it as a Java servlet. Should be
> fairly straight forward, and we have Tomcat running anyway.
Please see source:
 http://tools.wikimedia.de/~edwardspec/src/wikicnt_daemon.pl
This is written in Perl. Also, anything stange in HTTP connection
results in breaking it without answer. The only potential security
problem here is reading request line with
	my $req = <$c>;
(no check for long lines - this is not fatal for Perl but might take
some memory)

>
> * To get around latency issues with the database, don't spawn (cause
> more load on the already troubled DB); instead, cache updates in RAM for
> a minute or so, the flush the into the db in a single insert.
There's another problem. We need to save disk space too (seems like
default MediaWiki counter was disabled because it consumes too much
space - 4*12000*60*60*24 = 4147200000 bytes = 3955 Mb each day).

I used UPDATE statements insead. Yes, this is worse (for example,
INSERT can be optimized by writing into text file and applying it with
LOAD DATA LOCAL INFILE) but database can't become larger than 6 Mb
(for ruwiki with 900000 articles).

>
> * Edward used a lot of ram for a name -> id mapping. This should be
> avoided - the name is unique, we don't need the page ID. If we want the
> ID, it should be determined on the wikipedia server and supplied with
> the request - I talked to Tim Starling about making this and other
> useful things available as JS variables.
That's much more efficient to store IDs (they are smaller and always
fixed-sized). But actually this was requested by ruwiki users later in
order to save counter value after _renaming_ article. With titles,
this could be lost.

Now this is not a problem: the small copy of database (title as key
and id as value) was moved into GDBM file and in-memory cache is now
disabled. Database copy is updated each 00:00 (this takes 5-7 seconds)
and takes 14 Mb of disk space.

--

-- 
Edward Chernenko <edwardspec <at> gmail.com>


Gmane