Felipe Ortega | 6 Jun 2012 20:22
Picon
Picon
Favicon
Gravatar

Problems with frwiki dumps

Hello.

I'm finding strange issues when trying to decompress the 7z version of this dump for the French Wikipedia:

http://dumps.wikimedia.org/frwiki/20120430/

At some point around 3M revisions the 7z process stalls. After a long time (few hours) it recovers normal
execution, but then stalls again around 55M revisions to never recover normal cruise again.

Maybe there are some issues with frwiki dumps, since I can see that subsequent processes are experimenting
failures (in May and June).

I'm now checking with the previous dump (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you
know in case I find any more problems.

Best,
Felipe.

Platonides | 7 Jun 2012 18:52
Picon

Re: Problems with frwiki dumps

On 06/06/12 20:22, Felipe Ortega wrote:
> Hello.
> 
> I'm finding strange issues when trying to decompress the 7z version of this dump for the French Wikipedia:
> 
> http://dumps.wikimedia.org/frwiki/20120430/
> 
> At some point around 3M revisions the 7z process stalls. After a long time (few hours) it recovers normal
execution, but then stalls again around 55M revisions to never recover normal cruise again.
> 
> Maybe there are some issues with frwiki dumps, since I can see that subsequent processes are
experimenting failures (in May and June).
> 
> I'm now checking with the previous dump (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you
know in case I find any more problems.
> 
> Best,
> Felipe.

It apparently decompresses ok.
> time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z e -so
frwiki-20120430-pages-meta-history.xml.7z > /dev/null )
> 78eda06a57ea738a2e21697e31e52128  frwiki-20120430-pages-meta-history.xml.7z
> 
> real    25m55.503s
> user    0m28.549s
> sys     0m19.489s
> 
> 7-Zip 4.55 beta  Copyright (c) 1999-2007 Igor Pavlov  2007-09-05
> p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
(Continue reading)

Felipe Ortega | 7 Jun 2012 20:45
Picon
Picon
Favicon
Gravatar

Re: Problems with frwiki dumps

> De: Platonides <platonides@...>
> Para: Felipe Ortega <glimmer_phoenix@...>
> CC: "xmldatadumps-l@..." <xmldatadumps-l@...org>
> Enviado: Jueves 7 de junio de 2012 18:52
> Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
> 
> On 06/06/12 20:22, Felipe Ortega wrote:
>>  Hello.
>> 
>>  I'm finding strange issues when trying to decompress the 7z version of 
> this dump for the French Wikipedia:
>> 
>>  http://dumps.wikimedia.org/frwiki/20120430/
>> 
>>  At some point around 3M revisions the 7z process stalls. After a long time 
> (few hours) it recovers normal execution, but then stalls again around 55M 
> revisions to never recover normal cruise again.
>> 
>>  Maybe there are some issues with frwiki dumps, since I can see that 
> subsequent processes are experimenting failures (in May and June).
>> 
>>  I'm now checking with the previous dump 
> (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I 
> find any more problems.
>> 
>>  Best,
>>  Felipe.
> 
> 
> It apparently decompresses ok.
(Continue reading)

John | 7 Jun 2012 21:22
Picon

Re: Problems with frwiki dumps

Im running a basic check (parsing every page in the wiki) using the toolserver's 6-1 copy. Ill let you know if I see any issues.

John

On Thu, Jun 7, 2012 at 2:45 PM, Felipe Ortega <glimmer_phoenix-mRCrAkd8dF0@public.gmane.org> wrote:
> De: Platonides <platonides-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Para: Felipe Ortega <glimmer_phoenix-mRCrAkd8dF0@public.gmane.org>
> CC: "xmldatadumps-l-RusutVdil2icGmH+5r0DM0B+6BGkLq7r@public.gmane.org" <xmldatadumps-l-RusutVdil2icGmH+5r0DM0B+6BGkLq7r@public.gmane.org>
> Enviado: Jueves 7 de junio de 2012 18:52
> Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
>
> On 06/06/12 20:22, Felipe Ortega wrote:
>>  Hello.
>>
>>  I'm finding strange issues when trying to decompress the 7z version of
> this dump for the French Wikipedia:
>>
>>  http://dumps.wikimedia.org/frwiki/20120430/
>>
>>  At some point around 3M revisions the 7z process stalls. After a long time
> (few hours) it recovers normal execution, but then stalls again around 55M
> revisions to never recover normal cruise again.
>>
>>  Maybe there are some issues with frwiki dumps, since I can see that
> subsequent processes are experimenting failures (in May and June).
>>
>>  I'm now checking with the previous dump
> (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I
> find any more problems.
>>
>>  Best,
>>  Felipe.
>
>
> It apparently decompresses ok.
>>  time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z
> e -so frwiki-20120430-pages-meta-history.xml.7z > /dev/null )
>>  78eda06a57ea738a2e21697e31e52128  frwiki-20120430-pages-meta-history.xml.7z
>>
>>  real    25m55.503s
>>  user    0m28.549s
>>  sys     0m19.489s
>>
>>  7-Zip 4.55 beta  Copyright (c) 1999-2007 Igor Pavlov  2007-09-05
>>  p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
>>
>>  Processing archive: frwiki-20120430-pages-meta-history.xml.7z
>>
>>  Extracting  frwiki-20120430-pages-meta-history.xml
>>
>>  Everything is Ok
>>

Thanks, Platonides.

It's strange, then it might be something related to process scheduling (in Ubuntu server 12.04), but I haven't had any issues with other languages (including the many files in English).

So, last alternative would be to decompress it first and parse the xml (I see the size is ~125 GB).

Best,
Felipe.

>>
>>  Total:
>>  Folders: 0
>>  Files: 1
>>  Size: 1249323572065
>>  Compressed: 7526979951
>>
>>  real    163m59.124s
>>  user    138m30.290s
>>  sys     0m29.328s
>
>
> The content might be completely bogus, though. It'd need further checks.
> ----- Mensaje original -----


_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

<div>
<p>Im running a basic check (parsing every page in the wiki) using the toolserver's 6-1 copy. Ill let you know if I see any issues. <br><br>John<br><br></p>
<div class="gmail_quote">On Thu, Jun 7, 2012 at 2:45 PM, Felipe Ortega <span dir="ltr">&lt;<a href="mailto:glimmer_phoenix@..." target="_blank">glimmer_phoenix@...</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">&gt; De: Platonides &lt;<a href="mailto:platonides@...">platonides@...</a>&gt;<br>
&gt; Para: Felipe Ortega &lt;<a href="mailto:glimmer_phoenix@...">glimmer_phoenix@...</a>&gt;<br>
&gt; CC: "<a href="mailto:xmldatadumps-l@...">xmldatadumps-l@...</a>" &lt;<a href="mailto:xmldatadumps-l <at> lists.wikimedia.org">xmldatadumps-l@...</a>&gt;<br>
&gt; Enviado: Jueves 7 de junio de 2012 18:52<br>
&gt; Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps<br><div><div class="h5">&gt;<br>
&gt; On 06/06/12 20:22, Felipe Ortega wrote:<br>
&gt;&gt; &nbsp;Hello.<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;I'm finding strange issues when trying to decompress the 7z version of<br>
&gt; this dump for the French Wikipedia:<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;<a href="http://dumps.wikimedia.org/frwiki/20120430/" target="_blank">http://dumps.wikimedia.org/frwiki/20120430/</a><br>
&gt;&gt;<br>
&gt;&gt; &nbsp;At some point around 3M revisions the 7z process stalls. After a long time<br>
&gt; (few hours) it recovers normal execution, but then stalls again around 55M<br>
&gt; revisions to never recover normal cruise again.<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;Maybe there are some issues with frwiki dumps, since I can see that<br>
&gt; subsequent processes are experimenting failures (in May and June).<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;I'm now checking with the previous dump<br>
&gt; (<a href="http://dumps.wikimedia.org/frwiki/20120404/" target="_blank">http://dumps.wikimedia.org/frwiki/20120404/</a>). I'll let you know in case I<br>
&gt; find any more problems.<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;Best,<br>
&gt;&gt; &nbsp;Felipe.<br>
&gt;<br>
&gt;<br>
&gt; It apparently decompresses ok.<br>
&gt;&gt; &nbsp;time md5sum frwiki-20120430-pages-meta-history.xml.7z &amp;&amp; ( time 7z<br>
&gt; e -so frwiki-20120430-pages-meta-history.xml.7z &gt; /dev/null )<br>
&gt;&gt; &nbsp;78eda06a57ea738a2e21697e31e52128&nbsp; frwiki-20120430-pages-meta-history.xml.7z<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;real&nbsp; &nbsp; 25m55.503s<br>
&gt;&gt; &nbsp;user&nbsp; &nbsp; 0m28.549s<br>
&gt;&gt; &nbsp;sys&nbsp; &nbsp; &nbsp;0m19.489s<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;7-Zip 4.55 beta&nbsp; Copyright (c) 1999-2007 Igor Pavlov&nbsp; 2007-09-05<br>
&gt;&gt; &nbsp;p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;Processing archive: frwiki-20120430-pages-meta-history.xml.7z<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;Extracting&nbsp; frwiki-20120430-pages-meta-history.xml<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;Everything is Ok<br>
&gt;&gt;<br><br>
</div></div>Thanks, Platonides.<br><br>
It's strange, then it might be something related to process scheduling (in Ubuntu server 12.04), but I haven't had any issues with other languages (including the many files in English).<br><br>
So, last alternative would be to decompress it first and parse the xml (I see the size is ~125 GB).<br><br>
Best,<br>
Felipe.<br><div class="im">
<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;Total:<br>
&gt;&gt; &nbsp;Folders: 0<br>
&gt;&gt; &nbsp;Files: 1<br>
&gt;&gt; &nbsp;Size: 1249323572065<br>
&gt;&gt; &nbsp;Compressed: 7526979951<br>
&gt;&gt;<br>
&gt;&gt; &nbsp;real&nbsp; &nbsp; 163m59.124s<br>
&gt;&gt; &nbsp;user&nbsp; &nbsp; 138m30.290s<br>
&gt;&gt; &nbsp;sys&nbsp; &nbsp; &nbsp;0m29.328s<br>
&gt;<br>
&gt;<br>
&gt; The content might be completely bogus, though. It'd need further checks.<br>
</div>&gt; ----- Mensaje original -----<br><div class="HOEnZb"><div class="h5">
<br><br>
_______________________________________________<br>
Xmldatadumps-l mailing list<br><a href="mailto:Xmldatadumps-l@...">Xmldatadumps-l <at> lists.wikimedia.org</a><br><a href="https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l</a><br>
</div></div>
</blockquote>
</div>
<br>
</div>
Felipe Ortega | 7 Jun 2012 22:22
Picon
Picon
Favicon
Gravatar

Re: Problems with frwiki dumps

>________________________________
> De: John <phoenixoverride@...>
>Para: Felipe Ortega <glimmer_phoenix@...> 
>CC: Platonides <platonides@...>;
"xmldatadumps-l@..."
<xmldatadumps-l@...> 
>Enviado: Jueves 7 de junio de 2012 21:22
>Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
> 
>
>Im running a basic check (parsing every page in the wiki) using the toolserver's 6-1 copy. Ill let you know
if I see any issues. 
>

Thanks, John. It will be very useful. I'm getting a fresh copy of frwiki 20120430 dump and I will try to first
decompress and then parse the plain xml.

In any case, it is also a good chance to check possible delays of using pipes to redirect output of 7z e -so,
rather than using the plain file directly.

Felipe.

>John
>
>
>On Thu, Jun 7, 2012 at 2:45 PM, Felipe Ortega <glimmer_phoenix@...> wrote:
>
>> De: Platonides <platonides@...>
>>> Para: Felipe Ortega <glimmer_phoenix@...>
>>> CC: "xmldatadumps-l@..." <xmldatadumps-l@...a.org>
>>> Enviado: Jueves 7 de junio de 2012 18:52
>>> Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
>>
>>>
>>> On 06/06/12 20:22, Felipe Ortega wrote:
>>>>  Hello.
>>>>
>>>>  I'm finding strange issues when trying to decompress the 7z version of
>>> this dump for the French Wikipedia:
>>>>
>>>>  http://dumps.wikimedia.org/frwiki/20120430/
>>>>
>>>>  At some point around 3M revisions the 7z process stalls. After a long time
>>> (few hours) it recovers normal execution, but then stalls again around 55M
>>> revisions to never recover normal cruise again.
>>>>
>>>>  Maybe there are some issues with frwiki dumps, since I can see that
>>> subsequent processes are experimenting failures (in May and June).
>>>>
>>>>  I'm now checking with the previous dump
>>> (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I
>>> find any more problems.
>>>>
>>>>  Best,
>>>>  Felipe.
>>>
>>>
>>> It apparently decompresses ok.
>>>>  time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z
>>> e -so frwiki-20120430-pages-meta-history.xml.7z > /dev/null )
>>>>  78eda06a57ea738a2e21697e31e52128  frwiki-20120430-pages-meta-history.xml.7z
>>>>
>>>>  real    25m55.503s
>>>>  user    0m28.549s
>>>>  sys     0m19.489s
>>>>
>>>>  7-Zip 4.55 beta  Copyright (c) 1999-2007 Igor Pavlov  2007-09-05
>>>>  p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
>>>>
>>>>  Processing archive: frwiki-20120430-pages-meta-history.xml.7z
>>>>
>>>>  Extracting  frwiki-20120430-pages-meta-history.xml
>>>>
>>>>  Everything is Ok
>>>>
>>
>>Thanks, Platonides.
>>
>>It's strange, then it might be something related to process scheduling (in Ubuntu server 12.04), but I
haven't had any issues with other languages (including the many files in English).
>>
>>So, last alternative would be to decompress it first and parse the xml (I see the size is ~125 GB).
>>
>>Best,
>>Felipe.
>>
>>
>>>>
>>>>  Total:
>>>>  Folders: 0
>>>>  Files: 1
>>>>  Size: 1249323572065
>>>>  Compressed: 7526979951
>>>>
>>>>  real    163m59.124s
>>>>  user    138m30.290s
>>>>  sys     0m29.328s
>>>
>>>
>>> The content might be completely bogus, though. It'd need further checks.
>>> ----- Mensaje original -----
>>
>>
>>
>>_______________________________________________
>>Xmldatadumps-l mailing list
>>Xmldatadumps-l@...
>>https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>
>
>
>

John | 7 Jun 2012 23:26
Picon

Re: Problems with frwiki dumps

I needed to change the dump I was working on (I was only doing the current version dump) but you can track the status of the parse at http://tinyurl.com/frstatus

<div><p>I needed to change the dump I was working on (I was only doing the current version dump) but you can track the status of the parse at <a href="http://tinyurl.com/frstatus">http://tinyurl.com/frstatus</a> <br></p></div>
Felipe Ortega | 8 Jun 2012 00:13
Picon
Picon
Favicon
Gravatar

Re: Problems with frwiki dumps

>________________________________
> De: John <phoenixoverride@...>
>Para: Felipe Ortega <glimmer_phoenix@...> 
>CC: Platonides <platonides@...>;
"xmldatadumps-l@..."
<xmldatadumps-l@...> 
>Enviado: Jueves 7 de junio de 2012 23:26
>Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
> 
>
>I needed to change the dump I was working on (I was only doing the current version dump) but you can track the
status of the parse at http://tinyurl.com/frstatus
> 

Thanks for your help, John. I really appreciate it. It looks like the new copy is working fine, now. I will
check again tomorrow morning. It should finish overnight.

Best,
Felipe.

>
>

Platonides | 8 Jun 2012 11:30
Picon

Re: Problems with frwiki dumps

On 08/06/12 00:13, Felipe Ortega wrote:
> 
> Thanks for your help, John. I really appreciate it. It looks like the new copy is working fine, now. I will
check again tomorrow morning. It should finish overnight.
> 
> Best,
> Felipe.

$ time 7z e -so frwiki-20120430-pages-meta-history.xml.7z | sha256sum
7-Zip 4.55 beta  Copyright (c) 1999-2007 Igor Pavlov  2007-09-05
p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: frwiki-20120430-pages-meta-history.xml.7z

Extracting  frwiki-20120430-pages-meta-history.xml

Everything is Ok

Total:
Folders: 0
Files: 1
Size: 1249323572065
Compressed: 7526979951
5349850c5e9a7c03f0dee071a9143660a7d12847948bcb0b1564060c8a21e8c0  -

real    436m29.562s
user    357m41.568s
sys     49m58.540s

Felipe Ortega | 7 Jun 2012 23:47
Picon
Picon
Favicon
Gravatar

Re: Problems with frwiki dumps


----- Mensaje original -----
> De: Felipe Ortega <glimmer_phoenix@...>
> Para: Platonides <platonides@...>
> CC: "xmldatadumps-l@..." <xmldatadumps-l@...org>
> Enviado: Jueves 7 de junio de 2012 20:45
> Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
> 
>>  De: Platonides <platonides@...>
>>  Para: Felipe Ortega <glimmer_phoenix@...>
>>  CC: "xmldatadumps-l@..." 
> <xmldatadumps-l@...>
>>  Enviado: Jueves 7 de junio de 2012 18:52
>>  Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
>> 
>>> 
>>>   Total:
>>>   Folders: 0
>>>   Files: 1
>>>   Size: 1249323572065
>>>   Compressed: 7526979951
>>> 
>>>   real    163m59.124s
>>>   user    138m30.290s
>>>   sys     0m29.328s

Oops. Not so fast. It's 1.25 TB. Ok, trying again from compressed file.

Felipe.

>> 
>> 
>>  The content might be completely bogus, though. It'd need further 
> checks.
>>  ----- Mensaje original -----
> 


Gmane