james | 1 Jun 14:37 2005

image dump

just finished downloading the en image dump, tried extracting it with winrar - 
only to get an error - 'the archive is corupt' - though doing it with a 
different extractor, not only doesnt preserve the folders, there only seems to 
be 9200 images in the dump.... could well be more, as i havent extract it all, 
because of this folder thing

anyone of an extracter that does a folder intact job

once that is extracted - how will my wiki know where the images are for each 
artical, and thus include them within each artical?

thanks
Rowan Collins | 1 Jun 17:41 2005
Picon

Re: image dump

On 01/06/05, james <jamessampford <at> supanet.com> wrote:
> just finished downloading the en image dump, tried extracting it with winrar -

> anyone of an extracter that does a folder intact job

http://download.wikimedia.org/images/README_ABOUT_FILE_FORMAT.txt
mentions how to extract them correctly under *NIX system - so one way
would be to get cygwin up and running. A quick search also turned up
this Windows port of a "libarchive" which seems like it may include a
compatible "tar" utility -
http://gnuwin32.sourceforge.net/packages/libarchive.htm

> once that is extracted - how will my wiki know where the images are for each
> artical, and thus include them within each artical?

For that, you need to download the "image" and "imagelinks" database
tables from http://download.wikimedia.org/#en.wikipedia to go with
your "cur" dump (the "imagelinks" one could probably be rebuilt
programmatically, but that's likely to be slower than just downloading
it).

--

-- 
Rowan Collins BSc
[IMSoP]
Kate Turner | 2 Jun 01:03 2005
Picon

Re: image dump

Rowan Collins wrote in gmane.science.linguistics.wikipedia.technical:

> On 01/06/05, james <jamessampford <at> supanet.com> wrote:
>> just finished downloading the en image dump, tried extracting it with
>> winrar -

> http://download.wikimedia.org/images/README_ABOUT_FILE_FORMAT.txt
> mentions how to extract them correctly under *NIX system - so one way
> would be to get cygwin up and running. 

hmm. i completely forgot that people might want to extract the images under
non-Unix systems... :-(  the pax format is standard, but it's not widely
used - it's not the same as GNU's version, although GNU tar can read them. 
i guess most "multi-function" archive tools only understand POSIX and GNU
tar format.

i don't have a Windows system here to test, but if someone wants to
recommend an easy way to extract pax archives under Windows, i'll include
it there.  maybe we could distribute a standalone version of the Cygwin
binary?

> A quick search also turned up 
> this Windows port of a "libarchive" which seems like it may include a
> compatible "tar" utility -
> http://gnuwin32.sourceforge.net/packages/libarchive.htm

this looks fine.  i've added a link to here in the readme for now.

kate.
(Continue reading)

Timwi | 3 Jun 00:19 2005
Picon
Picon

Re: image dump

Kate Turner wrote:
> 
> hmm. i completely forgot that people might want to extract the images under
> non-Unix systems... :-(  the pax format is standard, but it's not widely
> used

Why do you have to use it if it's so poorly supported?
Kate Turner | 3 Jun 08:11 2005
Picon

Re: image dump

Timwi wrote in gmane.science.linguistics.wikipedia.technical:

> Kate Turner wrote:

>> the pax format is standard, but it's not widely used

> Why do you have to use it if it's so poorly supported?

i was unable to find any documentation on the file format that GNU tar uses
(see my previous messages to the list).  Zip was suggested as an
alternative, which is probably the most widely supported archive format. 
if pax turns out to be too unwieldy, it may be worth using that instead.

kate.
Timwi | 3 Jun 11:00 2005
Picon
Picon

Re: image dump

Kate Turner wrote:
> 
> i was unable to find any documentation on the file format that GNU tar uses

http://www.gnu.org/software/tar/manual/html_mono/tar.html#SEC134 ?
Kate Turner | 3 Jun 11:25 2005
Picon

Re: image dump

Timwi wrote in gmane.science.linguistics.wikipedia.technical:

> Kate Turner wrote:

>> i was unable to find any documentation on the file format that GNU tar
>> uses

> http://www.gnu.org/software/tar/manual/html_mono/tar.html#SEC134 ?

this is what i looked at before, but i can't find the relevant part of the
description.  it says:

/* Identifies the *next* file on the tape as having a long name.  */
#define GNUTYPE_LONGNAME 'L'

but does not indicate how the long name should be encoded in the archive
header, unless i'm missing it somewhere...

kate.

(interestingly, the manual says that GNU tar will use pax format by default
in the future, although i suppose that does not solve the immediate
problem ;-)

Gmane