Ariel T. Glenn | 2 Apr 2012 10:08
Picon

uploaded media for WMF projects available via rsync

This is phase one of a plan to make uploaded media from WMF projects
accessible for download in bulk.  It, like many other things lately, is
experimental and subject to breakage, change, etc.

First, a big thanks to Kevin Day from Your.org who offered us the space
and worked with us many hours to sort out networking issues, try
different NAS setups, and generally do what was needed to get this
going. 

Rsync url: ftpmirror.your.org::wikimedia-images/projectname/languagecode

For example: 

rsync -a ftpmirror.yours.org::wikimedia-images/wikipedia/commons /my/dir

would get you all of commons including archived versions (no deleted
images of course).

Folks who are trying to download media for a specific project should
bear in mind that they will need the files not only from that project
but also those which are hosted on commons and used on the local
project.  I'm looking into producing lists of those files for easy use
by rsyncers.  

I would suggest rather than everyone downloading a zillion copies of
commons at once, that folks coordinate a little bit, or just get the
pieces they need :-D

The data that is there now is probably about 15-20 days old.  It will
likely be a little while before I get the media rsync going on a regular
(Continue reading)

Jérémie Roquet | 2 Apr 2012 11:13
Picon

Re: uploaded media for WMF projects available via rsync

Hi,

2012/4/2 Ariel T. Glenn <ariel <at> wikimedia.org>:
> This is phase one of a plan to make uploaded media from WMF projects
> accessible for download in bulk.  It, like many other things lately, is
> experimental and subject to breakage, change, etc.
>
> First, a big thanks to Kevin Day from Your.org who offered us the space
> and worked with us many hours to sort out networking issues, try
> different NAS setups, and generally do what was needed to get this
> going.
>
> Rsync url: ftpmirror.your.org::wikimedia-images/projectname/languagecode
>
> For example:
>
> rsync -a ftpmirror.yours.org::wikimedia-images/wikipedia/commons /my/dir
>
> would get you all of commons including archived versions (no deleted
> images of course).

That's awesome news! Thank you all.

Best regards,

--

-- 
Jérémie

_______________________________________________
Xmldatadumps-l mailing list
(Continue reading)

Erik Moeller | 2 Apr 2012 19:07
Picon
Gravatar

Re: uploaded media for WMF projects available via rsync

On Mon, Apr 2, 2012 at 1:08 AM, Ariel T. Glenn <ariel@...> wrote:
> rsync -a ftpmirror.yours.org::wikimedia-images/wikipedia/commons /my/dir

(Typo: your.org not yours.org - bet you did that intentionally to
throttle us ;-)

Great work, thanks for making this happen. :-) Are we talking to
archive.org as well about setting up a copy there?

All best,
Erik
--

-- 
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate

Ariel T. Glenn | 2 Apr 2012 19:19
Picon

Re: uploaded media for WMF projects available via rsync

Στις 02-04-2012, ημέρα Δευ, και ώρα 10:07 -0700, ο/η Erik Moeller
έγραψε:
> On Mon, Apr 2, 2012 at 1:08 AM, Ariel T. Glenn <ariel <at> wikimedia.org> wrote:
> > rsync -a ftpmirror.yours.org::wikimedia-images/wikipedia/commons /my/dir
> 
> (Typo: your.org not yours.org - bet you did that intentionally to
> throttle us ;-)
> 
> Great work, thanks for making this happen. :-) Are we talking to
> archive.org as well about setting up a copy there?

And I proofread that three times and took an s out of the name somewhere
else in that email too, rats! :-P

No, I'm not talking to archive.org at the moment.  I have too many irons
in the fire at this point to open that discussion (including the
half-started project of automated uploads of dumps to archive.org at 6
month intervals).

This is really phase 0.1, there's a lot more to be done to make this
generally usable and to keep it regularly updated.

Ariel

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Ariel T. Glenn | 2 Apr 2012 19:56
Picon

Re: uploaded media for WMF projects available via rsync

Στις 02-04-2012, ημέρα Δευ, και ώρα 20:19 +0300, ο/η Ariel T. Glenn
έγραψε:

> No, I'm not talking to archive.org at the moment.  I have too many irons
> in the fire at this point to open that discussion (including the
> half-started project of automated uploads of dumps to archive.org at 6
> month intervals).
> 

And already I'm a liar... someone just passed me a name at IA and a
rumor, so I've left a message for my contacts there to see what's up.

A. 

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Alex Buie | 2 Apr 2012 21:38
Favicon

Re: uploaded media for WMF projects available via rsync

Hey guys,

Sorry for breaking the thread, but I just subscribed, so I think
this'll probably break mailman's threading headers.

This is very exciting news, and IA would love to have a copy! We're
more interested in being a historical mirror (on our item
infrastructure), rather than a live rsync/http/ftp mirror, but perhaps
we can also work something out mirroring the latest dumps. (How big
are the last 2 or so?)

I suppose the next step is for me and Ariel to talk about technical
procedures and details, et cetera, but I just wanted to subscribe to
this ml and introduce myself.

Ariel, when you have a minute to chat, shoot me an email (or skype).
I'm thinking we just pull things at whatever frequency you guys push
out the data to your.org (which may or may not be scheduled yet) and
throw them into new items on the cluster.

Others' thoughts are, of course, always welcome.

Thanks!

Alex Buie
Collections Group
Internet Archive, a registered California non-profit library
abuie@...

(Continue reading)

Alex Buie | 2 Apr 2012 21:42
Favicon

Re: uploaded media for WMF projects available via rsync

Of course, right after I send this, I got pointed here.
https://meta.wikimedia.org/wiki/Mailing_lists#Using_digests

Sorry 'bout that, heh.

Ariel T. Glenn | 2 Apr 2012 21:55
Picon

Re: uploaded media for WMF projects available via rsync

For space requirements etc for the xml dumps, see:
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
(these figures are pretty up to date).

For "dumps" of images, we have no such thing; this rsync mirror is the
first thing out of the gate and we can't possibly generate multiple
copies of it on different dates as we do for the xml dumps.

Come find me on irc (wikimedia-tech on freenode) or send me an email off
list and we can talk through the technical end of things.

I'm already working on automating copying up the xml archives to you
guys at 6month intervals or so.

Ariel

Στις 02-04-2012, ημέρα Δευ, και ώρα 15:38 -0400, ο/η Alex Buie έγραψε:
> Hey guys,
> 
> Sorry for breaking the thread, but I just subscribed, so I think
> this'll probably break mailman's threading headers.
> 
> This is very exciting news, and IA would love to have a copy! We're
> more interested in being a historical mirror (on our item
> infrastructure), rather than a live rsync/http/ftp mirror, but perhaps
> we can also work something out mirroring the latest dumps. (How big
> are the last 2 or so?)
> 
> I suppose the next step is for me and Ariel to talk about technical
> procedures and details, et cetera, but I just wanted to subscribe to
(Continue reading)

Alex Buie | 2 Apr 2012 22:20
Favicon

Re: uploaded media for WMF projects available via rsync

Ok, great. I'm cycling right now but I'll /join when I get back.

As far as image "dumps", I was unclear. I just meant the rsync mirror. We'll talk more later!

Thanks again.

Alex

<div>
<p>Ok, great. I'm cycling right now but I'll /join when I get back.</p>
<p>As far as image "dumps", I was unclear. I just meant the rsync mirror. We'll talk more later!</p>
<p>Thanks again.</p>
<p>Alex</p>
</div>
Platonides | 2 Apr 2012 23:43
Picon

Re: uploaded media for WMF projects available via rsync

On 02/04/12 21:55, Ariel T. Glenn wrote:
> For "dumps" of images, we have no such thing; this rsync mirror is the
> first thing out of the gate and we can't possibly generate multiple
> copies of it on different dates as we do for the xml dumps.

That's not too hard to do. You just copy the image tree with hardlinks,
making a version. Then the next rsync will only replace modified images.
(Unless you manually added the --inplace parameter, but in such case you
supposedly know what you're doing)
You could also use --link-dest instead of manaully building hardlink copies.

Alex Buie | 3 Apr 2012 00:06
Favicon

Re: uploaded media for WMF projects available via rsync

I also have a little script that uses rsync's --list-only output to
generate a list to feed to --files-from that takes an arbitrary begin
date, so I can pack them into daily or weekly sized units, which would
be less work for Ariel and WMF.

On Mon, Apr 2, 2012 at 5:43 PM, Platonides <platonides@...> wrote:
> On 02/04/12 21:55, Ariel T. Glenn wrote:
>> For "dumps" of images, we have no such thing; this rsync mirror is the
>> first thing out of the gate and we can't possibly generate multiple
>> copies of it on different dates as we do for the xml dumps.
>
> That's not too hard to do. You just copy the image tree with hardlinks,
> making a version. Then the next rsync will only replace modified images.
> (Unless you manually added the --inplace parameter, but in such case you
> supposedly know what you're doing)
> You could also use --link-dest instead of manaully building hardlink copies.

Ariel T. Glenn | 3 Apr 2012 08:25
Picon

Re: uploaded media for WMF projects available via rsync

Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
> On 02/04/12 21:55, Ariel T. Glenn wrote:
> > For "dumps" of images, we have no such thing; this rsync mirror is the
> > first thing out of the gate and we can't possibly generate multiple
> > copies of it on different dates as we do for the xml dumps.
> 
> That's not too hard to do. You just copy the image tree with hardlinks,
> making a version. Then the next rsync will only replace modified images.
> (Unless you manually added the --inplace parameter, but in such case you
> supposedly know what you're doing)
> You could also use --link-dest instead of manaully building hardlink copies.

Yes, that would work fine under the existing setup; what I don't know
and what needs to be figured out is what we will do when images are
moved into swift, Real Soon Now.

Ariel

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
emijrp | 3 Apr 2012 19:55
Picon

Re: uploaded media for WMF projects available via rsync

Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.

2012/4/3 Ariel T. Glenn <ariel <at> wikimedia.org>
Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
> On 02/04/12 21:55, Ariel T. Glenn wrote:
> > For "dumps" of images, we have no such thing; this rsync mirror is the
> > first thing out of the gate and we can't possibly generate multiple
> > copies of it on different dates as we do for the xml dumps.
>
> That's not too hard to do. You just copy the image tree with hardlinks,
> making a version. Then the next rsync will only replace modified images.
> (Unless you manually added the --inplace parameter, but in such case you
> supposedly know what you're doing)
> You could also use --link-dest instead of manaully building hardlink copies.

Yes, that would work fine under the existing setup; what I don't know
and what needs to be figured out is what we will do when images are
moved into swift, Real Soon Now.

Ariel





_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

<div>
<p>Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.<br><br></p>
<div class="gmail_quote">2012/4/3 Ariel T. Glenn <span dir="ltr">&lt;<a href="mailto:ariel@...">ariel <at> wikimedia.org</a>&gt;</span><br><blockquote class="gmail_quote">&Sigma;&tau;&iota;&sigmaf; 02-04-2012, &eta;&mu;&#941;&rho;&alpha; &Delta;&epsilon;&upsilon;, &kappa;&alpha;&iota; &#974;&rho;&alpha; 23:43 +0200, &omicron;/&eta; Platonides &#941;&gamma;&rho;&alpha;&psi;&epsilon;:<br><div class="im">&gt; On 02/04/12 21:55, Ariel T. Glenn wrote:<br>
&gt; &gt; For "dumps" of images, we have no such thing; this rsync mirror is the<br>
&gt; &gt; first thing out of the gate and we can't possibly generate multiple<br>
&gt; &gt; copies of it on different dates as we do for the xml dumps.<br>
&gt;<br>
&gt; That's not too hard to do. You just copy the image tree with hardlinks,<br>
&gt; making a version. Then the next rsync will only replace modified images.<br>
&gt; (Unless you manually added the --inplace parameter, but in such case you<br>
&gt; supposedly know what you're doing)<br>
&gt; You could also use --link-dest instead of manaully building hardlink copies.<br><br>
</div>Yes, that would work fine under the existing setup; what I don't know<br>
and what needs to be figured out is what we will do when images are<br>
moved into swift, Real Soon Now.<br><br>
Ariel<br><div>
<div></div>
<div class="h5">
<br><br><br><br><br>
_______________________________________________<br>
Xmldatadumps-l mailing list<br><a href="mailto:Xmldatadumps-l@...">Xmldatadumps-l <at> lists.wikimedia.org</a><br><a href="https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l</a><br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
Ariel T. Glenn | 3 Apr 2012 20:32
Picon

Re: uploaded media for WMF projects available via rsync

Not as such.  However I intend to produce a list of images per project
blah blah (some handwaving here) with the date of last upload, and
excerpts from this could be used as input to rsync.

Ariel

Στις 03-04-2012, ημέρα Τρι, και ώρα 19:55 +0200, ο/η emijrp έγραψε:
> Does rsync mirror allow to download Commons images by date? I mean, in
> day-by-day packages. That was the method we wanted to use at WikiTeam
> to archive Wikimedia Commons.
> 
> 2012/4/3 Ariel T. Glenn <ariel <at> wikimedia.org>
>         Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η
>         Platonides έγραψε:
>         > On 02/04/12 21:55, Ariel T. Glenn wrote:
>         > > For "dumps" of images, we have no such thing; this rsync
>         mirror is the
>         > > first thing out of the gate and we can't possibly generate
>         multiple
>         > > copies of it on different dates as we do for the xml
>         dumps.
>         >
>         > That's not too hard to do. You just copy the image tree with
>         hardlinks,
>         > making a version. Then the next rsync will only replace
>         modified images.
>         > (Unless you manually added the --inplace parameter, but in
>         such case you
>         > supposedly know what you're doing)
>         > You could also use --link-dest instead of manaully building
>         hardlink copies.
>         
>         
>         Yes, that would work fine under the existing setup; what I
>         don't know
>         and what needs to be figured out is what we will do when
>         images are
>         moved into swift, Real Soon Now.
>         
>         Ariel
>         
>         
>         
>         
>         
>         
>         _______________________________________________
>         Xmldatadumps-l mailing list
>         Xmldatadumps-l <at> lists.wikimedia.org
>         https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>         
> 

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Alex Buie | 3 Apr 2012 20:56
Favicon

Re: uploaded media for WMF projects available via rsync

No, they don't currently, but I'm working with the your.org guys to
get a copy of the mirror mounted as NFS, and then I should be able to
combine that with the XML dumps for the images to process media from
certain time periods.

I wonder how well python's lxml handles multigigabyte XML files...
Guess we'll see :)

Alex

2012/4/3 emijrp <emijrp <at> gmail.com>:
> Does rsync mirror allow to download Commons images by date? I mean, in
> day-by-day packages. That was the method we wanted to use at WikiTeam to
> archive Wikimedia Commons.
>
> 2012/4/3 Ariel T. Glenn <ariel <at> wikimedia.org>
>>
>> Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
>> > On 02/04/12 21:55, Ariel T. Glenn wrote:
>> > > For "dumps" of images, we have no such thing; this rsync mirror is the
>> > > first thing out of the gate and we can't possibly generate multiple
>> > > copies of it on different dates as we do for the xml dumps.
>> >
>> > That's not too hard to do. You just copy the image tree with hardlinks,
>> > making a version. Then the next rsync will only replace modified images.
>> > (Unless you manually added the --inplace parameter, but in such case you
>> > supposedly know what you're doing)
>> > You could also use --link-dest instead of manaully building hardlink
>> > copies.
>>
>> Yes, that would work fine under the existing setup; what I don't know
>> and what needs to be figured out is what we will do when images are
>> moved into swift, Real Soon Now.
>>
>> Ariel
>>
>>
>>
>>
>>
>> _______________________________________________
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l <at> lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
emijrp | 6 Apr 2012 10:15
Picon

Re: uploaded media for WMF projects available via rsync

2012/4/3 Alex Buie <abuie-9Mglyldi0ctAfugRpC6u6w@public.gmane.org>
No, they don't currently, but I'm working with the your.org guys to
get a copy of the mirror mounted as NFS, and then I should be able to
combine that with the XML dumps for the images to process media from
certain time periods.

I wonder how well python's lxml handles multigigabyte XML files...
Guess we'll see :)


Pywikipediabot uses cElementTree for Python, which is fast as hell.
 
Alex

2012/4/3 emijrp <emijrp-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> Does rsync mirror allow to download Commons images by date? I mean, in
> day-by-day packages. That was the method we wanted to use at WikiTeam to
> archive Wikimedia Commons.
>
> 2012/4/3 Ariel T. Glenn <ariel-AeOJrEpdGNeGglJvpFV4uA@public.gmane.org>
>>
>> Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
>> > On 02/04/12 21:55, Ariel T. Glenn wrote:
>> > > For "dumps" of images, we have no such thing; this rsync mirror is the
>> > > first thing out of the gate and we can't possibly generate multiple
>> > > copies of it on different dates as we do for the xml dumps.
>> >
>> > That's not too hard to do. You just copy the image tree with hardlinks,
>> > making a version. Then the next rsync will only replace modified images.
>> > (Unless you manually added the --inplace parameter, but in such case you
>> > supposedly know what you're doing)
>> > You could also use --link-dest instead of manaully building hardlink
>> > copies.
>>
>> Yes, that would work fine under the existing setup; what I don't know
>> and what needs to be figured out is what we will do when images are
>> moved into swift, Real Soon Now.
>>
>> Ariel
>>
>>
>>
>>
>>
>> _______________________________________________
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l <at> lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>

<div>
<div class="gmail_quote">2012/4/3 Alex Buie <span dir="ltr">&lt;<a href="mailto:abuie@...">abuie@...</a>&gt;</span><br><blockquote class="gmail_quote">

No, they don't currently, but I'm working with the <a href="http://your.org" target="_blank">your.org</a> guys to<br>
get a copy of the mirror mounted as NFS, and then I should be able to<br>
combine that with the XML dumps for the images to process media from<br>
certain time periods.<br><br>
I wonder how well python's lxml handles multigigabyte XML files...<br>
Guess we'll see :)<br><br>
</blockquote>
<div><br></div>
<div>Pywikipediabot uses cElementTree for Python, which is fast as hell.</div>
<div>&nbsp;</div>
<blockquote class="gmail_quote">

Alex<br><br>
2012/4/3 emijrp &lt;<a href="mailto:emijrp@...">emijrp@...</a>&gt;:<br><div>
<div></div>
<div class="h5">&gt; Does rsync mirror allow to download Commons images by date? I mean, in<br>
&gt; day-by-day packages. That was the method we wanted to use at WikiTeam to<br>
&gt; archive Wikimedia Commons.<br>
&gt;<br>
&gt; 2012/4/3 Ariel T. Glenn &lt;<a href="mailto:ariel@...">ariel@...</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt; &Sigma;&tau;&iota;&sigmaf; 02-04-2012, &eta;&mu;&#941;&rho;&alpha; &Delta;&epsilon;&upsilon;, &kappa;&alpha;&iota; &#974;&rho;&alpha; 23:43 +0200, &omicron;/&eta; Platonides &#941;&gamma;&rho;&alpha;&psi;&epsilon;:<br>
&gt;&gt; &gt; On 02/04/12 21:55, Ariel T. Glenn wrote:<br>
&gt;&gt; &gt; &gt; For "dumps" of images, we have no such thing; this rsync mirror is the<br>
&gt;&gt; &gt; &gt; first thing out of the gate and we can't possibly generate multiple<br>
&gt;&gt; &gt; &gt; copies of it on different dates as we do for the xml dumps.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; That's not too hard to do. You just copy the image tree with hardlinks,<br>
&gt;&gt; &gt; making a version. Then the next rsync will only replace modified images.<br>
&gt;&gt; &gt; (Unless you manually added the --inplace parameter, but in such case you<br>
&gt;&gt; &gt; supposedly know what you're doing)<br>
&gt;&gt; &gt; You could also use --link-dest instead of manaully building hardlink<br>
&gt;&gt; &gt; copies.<br>
&gt;&gt;<br>
&gt;&gt; Yes, that would work fine under the existing setup; what I don't know<br>
&gt;&gt; and what needs to be figured out is what we will do when images are<br>
&gt;&gt; moved into swift, Real Soon Now.<br>
&gt;&gt;<br>
&gt;&gt; Ariel<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; _______________________________________________<br>
&gt;&gt; Xmldatadumps-l mailing list<br>
&gt;&gt; <a href="mailto:Xmldatadumps-l@...">Xmldatadumps-l <at> lists.wikimedia.org</a><br>
&gt;&gt; <a href="https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l</a><br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; _______________________________________________<br>
&gt; Xmldatadumps-l mailing list<br>
&gt; <a href="mailto:Xmldatadumps-l@...">Xmldatadumps-l <at> lists.wikimedia.org</a><br>
&gt; <a href="https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l</a><br>
&gt;<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
fox | 6 Apr 2012 10:43
Picon
Gravatar

Re: uploaded media for WMF projects available via rsync

On 06/04/2012 10:15, emijrp wrote:
> 2012/4/3 Alex Buie <abuie@... <mailto:abuie@...>>
>     I wonder how well python's lxml handles multigigabyte XML files...
>     Guess we'll see :)
> 
> 
> Pywikipediabot uses cElementTree for Python, which is fast as hell.

We've been using cElementTree for a lot of time in wiki-network
(https://github.com/volpino/wiki-network) a suite of scripts to analyize
dumps of wikipedia, in particular for social network analysis purposes.
It's really fast even on huge dumps, like enwiki-pages-meta-history

It's open source so you are welcome to use it and contribute to the project!

-- 
f.

  "Always code as if the guy who ends up maintaining your code will be a
   violent psychopath who knows where you live."
  (Martin Golding)

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

http://about.me/fox91

On 06/04/2012 10:15, emijrp wrote:
> 2012/4/3 Alex Buie <abuie@... <mailto:abuie@...>>
>     I wonder how well python's lxml handles multigigabyte XML files...
>     Guess we'll see :)
> 
> 
> Pywikipediabot uses cElementTree for Python, which is fast as hell.

We've been using cElementTree for a lot of time in wiki-network
(https://github.com/volpino/wiki-network) a suite of scripts to analyize
dumps of wikipedia, in particular for social network analysis purposes.
It's really fast even on huge dumps, like enwiki-pages-meta-history

It's open source so you are welcome to use it and contribute to the project!

--

-- 
f.

  "Always code as if the guy who ends up maintaining your code will be a
   violent psychopath who knows where you live."
  (Martin Golding)

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

http://about.me/fox91

Alex Buie | 6 Apr 2012 16:01

Re: uploaded media for WMF projects available via rsync

  Excellent, thanks guys. I'm assuming that I shouldn't have to worry about malformed xml (hopefully, haha), which makes it even easier/faster.

Alex

On Apr 6, 2012 4:43 AM, "fox" <fox91-wRseHvTNsxg@public.gmane.org> wrote:
On 06/04/2012 10:15, emijrp wrote:
> 2012/4/3 Alex Buie <abuie <at> archive.org <mailto:abuie <at> archive.org>>
>     I wonder how well python's lxml handles multigigabyte XML files...
>     Guess we'll see :)
>
>
> Pywikipediabot uses cElementTree for Python, which is fast as hell.

We've been using cElementTree for a lot of time in wiki-network
(https://github.com/volpino/wiki-network) a suite of scripts to analyize
dumps of wikipedia, in particular for social network analysis purposes.
It's really fast even on huge dumps, like enwiki-pages-meta-history

It's open source so you are welcome to use it and contribute to the project!

--
f.

 "Always code as if the guy who ends up maintaining your code will be a
  violent psychopath who knows where you live."
 (Martin Golding)

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

http://about.me/fox91


_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

<div>
<p>&nbsp; Excellent, thanks guys. I'm assuming that I shouldn't have to worry about malformed xml (hopefully, haha), which makes it even easier/faster.</p>
<p>Alex</p>
<div class="gmail_quote">On Apr 6, 2012 4:43 AM, "fox" &lt;<a href="mailto:fox91@...">fox91@...</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote">
On 06/04/2012 10:15, emijrp wrote:<br>
&gt; 2012/4/3 Alex Buie &lt;<a href="mailto:abuie@...">abuie <at> archive.org</a> &lt;mailto:<a href="mailto:abuie@...">abuie <at> archive.org</a>&gt;&gt;<br>
&gt; &nbsp; &nbsp; I wonder how well python's lxml handles multigigabyte XML files...<br>
&gt; &nbsp; &nbsp; Guess we'll see :)<br>
&gt;<br>
&gt;<br>
&gt; Pywikipediabot uses cElementTree for Python, which is fast as hell.<br><br>
We've been using cElementTree for a lot of time in wiki-network<br>
(<a href="https://github.com/volpino/wiki-network" target="_blank">https://github.com/volpino/wiki-network</a>) a suite of scripts to analyize<br>
dumps of wikipedia, in particular for social network analysis purposes.<br>
It's really fast even on huge dumps, like enwiki-pages-meta-history<br><br>
It's open source so you are welcome to use it and contribute to the project!<br><br>
--<br>
f.<br><br>
 &nbsp;"Always code as if the guy who ends up maintaining your code will be a<br>
 &nbsp; violent psychopath who knows where you live."<br>
 &nbsp;(Martin Golding)<br><br>
() &nbsp;ascii ribbon campaign - against html e-mail<br>
/\ &nbsp;<a href="http://www.asciiribbon.org" target="_blank">www.asciiribbon.org</a> &nbsp; - against proprietary attachments<br><br><a href="http://about.me/fox91" target="_blank">http://about.me/fox91</a><br><br><br>_______________________________________________<br>
Xmldatadumps-l mailing list<br><a href="mailto:Xmldatadumps-l@...">Xmldatadumps-l <at> lists.wikimedia.org</a><br><a href="https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l</a><br><br>
</blockquote>
</div>
</div>
fox | 6 Apr 2012 20:16
Picon
Gravatar

Re: uploaded media for WMF projects available via rsync

On 06/04/2012 16:01, Alex Buie wrote:
>   Excellent, thanks guys. I'm assuming that I shouldn't have to worry
> about malformed xml (hopefully, haha), which makes it even easier/faster.

The dumps are well formed xml of course, the problem is that not always
the tags are in the same order or the revision are on chronological
order...and of course the revision text is a real mess!

I suggest you to have a look at our library and start by using it for
building simple scripts. It's really easy! All you have to do is to
write a method for every tag called process_tag (e.g.: process_title for
title tag). Have a look at
https://github.com/volpino/wiki-network/blob/master/revisions_page.py
for an example, it's a simple script that takes a pages-meta-history
dump and extracts the revisions of a specific page set to a csv file.

Feel free to write me for more information ;)

--

-- 
f.

  "Always code as if the guy who ends up maintaining your code will be a
   violent psychopath who knows where you live."
  (Martin Golding)

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

http://about.me/fox91


Gmane