William Lee | 1 Dec 18:34 2011

Proposal for new table image_metadata

I'm a developer at Wikia. We have a use case for searching through a file's
metadata. This task is challenging now, because the field
Image.img_metadata is a blob.

We propose expanding the metadata field into a new table. We propose the
name image_metadata. It will have three columns: img_name, attribute
(varchar) and value (varchar). It can be joined with Image on img_name.

On the application side, LocalFile's load* and decodeRow methods will have
to be changed to support the new table.

One issue to consider is the file archive. Should we replicate the metadata
table for file archive? Or serialize the data and store it in a new table
(something like fa_metadata)?

Please let us know if you see any issues with this plan. We hope that this
will be useful to the MediaWiki project, and a candidate to merge back.

Thanks,
Will
David Gerard | 1 Dec 18:36 2011
Picon

Re: Proposal for new table image_metadata

On 1 December 2011 17:34, William Lee <wlee <at> wikia-inc.com> wrote:

> I'm a developer at Wikia. We have a use case for searching through a file's
> metadata. This task is challenging now, because the field
> Image.img_metadata is a blob.

This sounds a natural for Commons, too.

- d.
Chad | 1 Dec 18:36 2011
Picon

Re: Proposal for new table image_metadata

On Thu, Dec 1, 2011 at 12:34 PM, William Lee <wlee <at> wikia-inc.com> wrote:
> I'm a developer at Wikia. We have a use case for searching through a file's
> metadata. This task is challenging now, because the field
> Image.img_metadata is a blob.
>
> We propose expanding the metadata field into a new table. We propose the
> name image_metadata. It will have three columns: img_name, attribute
> (varchar) and value (varchar). It can be joined with Image on img_name.
>
> On the application side, LocalFile's load* and decodeRow methods will have
> to be changed to support the new table.
>
> One issue to consider is the file archive. Should we replicate the metadata
> table for file archive? Or serialize the data and store it in a new table
> (something like fa_metadata)?
>
> Please let us know if you see any issues with this plan. We hope that this
> will be useful to the MediaWiki project, and a candidate to merge back.
>

That was part of bawolff's plan last summer for GSoC when he overhauled
our metadata support. He got a lot of his project done, but never quite got
to this point. Something we'd definitely like to see though!

-Chad
Sumana Harihareswara | 1 Dec 23:41 2011
Picon

Re: Proposal for new table image_metadata

On 12/01/2011 12:36 PM, Chad wrote:
> On Thu, Dec 1, 2011 at 12:34 PM, William Lee <wlee <at> wikia-inc.com> wrote:
>> I'm a developer at Wikia. We have a use case for searching through a file's
>> metadata. This task is challenging now, because the field
>> Image.img_metadata is a blob.
>>
>> We propose expanding the metadata field into a new table. We propose the
>> name image_metadata. It will have three columns: img_name, attribute
>> (varchar) and value (varchar). It can be joined with Image on img_name.
>>
>> On the application side, LocalFile's load* and decodeRow methods will have
>> to be changed to support the new table.
>>
>> One issue to consider is the file archive. Should we replicate the metadata
>> table for file archive? Or serialize the data and store it in a new table
>> (something like fa_metadata)?
>>
>> Please let us know if you see any issues with this plan. We hope that this
>> will be useful to the MediaWiki project, and a candidate to merge back.
>>
> 
> That was part of bawolff's plan last summer for GSoC when he overhauled
> our metadata support. He got a lot of his project done, but never quite got
> to this point. Something we'd definitely like to see though!
> 
> -Chad

William,
https://www.mediawiki.org/wiki/Summer_of_Code_Past_Projects#Improve_metadata_support
points me to https://www.mediawiki.org/wiki/Special:Code/MediaWiki/86169
(Continue reading)

Daniel Friesen | 2 Dec 00:36 2011

Re: Proposal for new table image_metadata

On Thu, 01 Dec 2011 09:34:03 -0800, William Lee <wlee <at> wikia-inc.com> wrote:

> I'm a developer at Wikia. We have a use case for searching through a  
> file's
> metadata. This task is challenging now, because the field
> Image.img_metadata is a blob.
>
> We propose expanding the metadata field into a new table. We propose the
> name image_metadata. It will have three columns: img_name, attribute
> (varchar) and value (varchar). It can be joined with Image on img_name.
>
> On the application side, LocalFile's load* and decodeRow methods will  
> have
> to be changed to support the new table.
>
> One issue to consider is the file archive. Should we replicate the  
> metadata
> table for file archive? Or serialize the data and store it in a new table
> (something like fa_metadata)?
>
> Please let us know if you see any issues with this plan. We hope that  
> this
> will be useful to the MediaWiki project, and a candidate to merge back.
>
> Thanks,
> Will

imgmeta_name, imgmeta_attribute, imgmeta_value would fit our standards for  
column naming better.

(Continue reading)

Brion Vibber | 5 Dec 19:54 2011
Picon

Re: Proposal for new table image_metadata

On Thu, Dec 1, 2011 at 3:36 PM, Daniel Friesen <lists <at> nadir-seen-fire.com>wrote:

> Why isn't our image table primary key an integer anyways?
>

In part, legacy foolishness. :)

Also, the physical storage of images is still tied to the title, so
anything that renames already has to run around renaming things. :(

-- brion
Neil Kandalgaonkar | 2 Dec 01:04 2011
Picon

Re: Proposal for new table image_metadata

Sounds like a good idea to me.

What things are you interested in searching? I'd like to clean up 
metadata a bit. Except for latitude and longitude, we don't have any 
notion of what the image metadata means. For example we could use a 
standard machine-readable notion of creation date, or author, or license.

Also, the current metadata scheme is just serialized PHP, so it allows 
for rich data structures in values. So a flat key-val store may not be 
able to hold everything.

On 12/1/11 9:34 AM, William Lee wrote:
> I'm a developer at Wikia. We have a use case for searching through a file's
> metadata. This task is challenging now, because the field
> Image.img_metadata is a blob.
>
> We propose expanding the metadata field into a new table. We propose the
> name image_metadata. It will have three columns: img_name, attribute
> (varchar) and value (varchar). It can be joined with Image on img_name.
>
> On the application side, LocalFile's load* and decodeRow methods will have
> to be changed to support the new table.
>
> One issue to consider is the file archive. Should we replicate the metadata
> table for file archive? Or serialize the data and store it in a new table
> (something like fa_metadata)?
>
> Please let us know if you see any issues with this plan. We hope that this
> will be useful to the MediaWiki project, and a candidate to merge back.
>
(Continue reading)

bawolff | 2 Dec 05:49 2011
Picon

Re: Proposal for new table image_metadata

> Message: 7
> Date: Thu, 1 Dec 2011 12:36:02 -0500
> From: Chad <innocentkiller <at> gmail.com>
> Subject: Re: [Wikitech-l] Proposal for new table image_metadata
> To: Wikimedia developers <wikitech-l <at> lists.wikimedia.org>
> Message-ID:
>       <CADn73rNuSX8RegdUBCeSYG8Mz1qg5SA49VAmB5eD_Y-vB-L4dw <at> mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> On Thu, Dec 1, 2011 at 12:34 PM, William Lee <wlee <at> wikia-inc.com> wrote:
> > I'm a developer at Wikia. We have a use case for searching through a file's
> > metadata. This task is challenging now, because the field
> > Image.img_metadata is a blob.
> >
> > We propose expanding the metadata field into a new table. We propose the
> > name image_metadata. It will have three columns: img_name, attribute
> > (varchar) and value (varchar). It can be joined with Image on img_name.
> >
> > On the application side, LocalFile's load* and decodeRow methods will have
> > to be changed to support the new table.
> >
> > One issue to consider is the file archive. Should we replicate the metadata
> > table for file archive? Or serialize the data and store it in a new table
> > (something like fa_metadata)?
> >
> > Please let us know if you see any issues with this plan. We hope that this
> > will be useful to the MediaWiki project, and a candidate to merge back.
> >
>
> That was part of bawolff's plan last summer for GSoC when he overhauled
(Continue reading)

Brion Vibber | 5 Dec 20:07 2011
Picon

Re: Proposal for new table image_metadata

On Thu, Dec 1, 2011 at 8:49 PM, bawolff <bawolff+wn <at> gmail.com> wrote:

> Thus, just storing a table of key/value pairs is kind of problematic -
> how do you store an "array" value. Additionally you have to consider
> finding info. You probably want to efficiently be able to search
> through lang values in a specific language, or for a specific property
> and not caring for the language.
>

Two easiest things based on my previous experience:
1) separate values with \x00, making them easy to split after extracting a
row
2) store multiple entries with an index field, making it easy to query for
potentially multiples

> Also consider how big a metadata field can get. Theoretically it's not
> really limited, well I don't expect it to be huge, > 255 bytes of
> utf-8 seems a totally reasonable size for a value of a metadata field.
>
> Last of all, you have to keep in mind all sorts of stuff is stored in
> the img_metadata. This includes things like the text layer of Djvu
> files (although arguably that shouldn't be stored there...) and other
> handler specific things (OggHandler stores some very complex
> structures in img_metadata). Of course, we could just keep the
> img_metadata blob there, and simply stop using it for "exif-like"
> data, but continue using it for handler specific ugly metadata that's
> generally invisible to user [probably a good idea. The two types of
> data are actually quite different].
>

(Continue reading)

Lars Aronsson | 5 Dec 22:45 2011
Picon

Re: Proposal for new table image_metadata

On 12/05/2011 08:07 PM, Brion Vibber wrote:
> If extracted page text is stored in a better key-value store, we should
> make sure it doesn't get pulled in to backwards-compatible metadata blobs
> (if we keep em around as they are now) -- but they should be accessible
> through some API.

One thing to consider is what happens if a user edits metadata,
e.g. adds EXIF data that was lost by cropping, or if a new
(cropped) version of the image is uploaded with the same name.

Another thing is image annotations, that are today always added
as plain text in the image description page,
http://commons.wikimedia.org/wiki/Commons:Image_annotations

A third thing is timed text (video subtitles), which today is added
in separate subpages, one for each language,
http://commons.wikimedia.org/wiki/Commons:Timed_Text

A fourth thing is proofreading: If OCR text was extracted from a
PDF or DJvu and then proofread in Wikisource, shouldn't the
next person that downloads the PDF file get the new text?

Perhaps a system for managing image + text, including wiki
editing, could address all four things above? In particular,
image annotations and OCR text are both tied to coordinates
in the image. (And timed text is tied to a time position in
a video stream.) So why are they separate systems?

--

-- 
   Lars Aronsson (lars <at> aronsson.se)
(Continue reading)

William Lee | 5 Dec 20:08 2011

Re: Proposal for new table image_metadata

Thanks to everyone for your feedback about this plan.

After careful consideration, we have decided to discontinue our plan. It
does not go far enough to support the XMP standard. Instead, we will use
the field Image.img_metadata for the time being.

William

On Thu, Dec 1, 2011 at 8:49 PM, bawolff <bawolff+wn <at> gmail.com> wrote:

> > Message: 7
> > Date: Thu, 1 Dec 2011 12:36:02 -0500
> > From: Chad <innocentkiller <at> gmail.com>
> > Subject: Re: [Wikitech-l] Proposal for new table image_metadata
> > To: Wikimedia developers <wikitech-l <at> lists.wikimedia.org>
> > Message-ID:
> >       <
> CADn73rNuSX8RegdUBCeSYG8Mz1qg5SA49VAmB5eD_Y-vB-L4dw <at> mail.gmail.com>
> > Content-Type: text/plain; charset=UTF-8
> >
> > On Thu, Dec 1, 2011 at 12:34 PM, William Lee <wlee <at> wikia-inc.com> wrote:
> > > I'm a developer at Wikia. We have a use case for searching through a
> file's
> > > metadata. This task is challenging now, because the field
> > > Image.img_metadata is a blob.
> > >
> > > We propose expanding the metadata field into a new table. We propose
> the
> > > name image_metadata. It will have three columns: img_name, attribute
> > > (varchar) and value (varchar). It can be joined with Image on img_name.
(Continue reading)

Krinkle | 6 Dec 02:36 2011
Picon

Re: Proposal for new table image_metadata

On Thu, Dec 1, 2011 at 6:34 PM, William Lee <wlee <at> wikia-inc.com> wrote:

> We propose expanding the metadata field into a new table. We propose the
> name image_metadata. It will have three columns: img_name, attribute
> (varchar) and value (varchar). It can be joined with Image on img_name.
>
>
Per convention this should probably read "file" instead of image, (like is
already
done with namespaces and the "filearchive" table). Anyway, that's just
naming.

A major problem as mentioned before in this thread is a key. Right now
files (both the files as an abstract thing or the versions) have a no unique
key. All they have is a page title and a timestamp.

This is related to the License-integration project[1] (that name is a bit
outdated,
it started for license information, but it basically aiming at storing all
kinds of
file properties).

The first blocker bug would be
https://bugzilla.wikimedia.org/show_bug.cgi?id=26741
(image/oldimage to filerevision).

And another one would be to make the file system even more like
page/revisions.
By giving implementing file ids and filerevision ids.

(Continue reading)


Gmane