Jeremias Maerki | 19 Nov 2007 10:26
Picon
Gravatar

Metadata use by Apache Java projects

(I realize this is heavy cross-posting but it's probably the best way to
reach all the players I want to address.)

As you may know, I've started developing an XMP metadata package inside
XML Graphics Commons in order to support XMP metadata (and ultimately
PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.

What is XMP? XMP, for those who don't know about it, is based on a
subset of RDF to provide a flexible and extensible way of
storing/representing document metadata.

Yesterday, I was surprised to discover that Adobe has published an XMP
Toolkit with Java support under the BSD license. In contrast to my
effort, Adobe's toolkit is quite complete if maybe a bit more
complicated to use. That got me thinking:

Every project I'm sending this message to is using document metadata in
some form:
- Apache XML Graphics: embeds document metadata in the generated files
(just FOP at the moment, but Batik is a similar candidate)
- Tika (in incubation): has as one of its main purposes the extraction
of metadata
- Sanselan (in incubation): extracts and embeds metadata from/in bitmap
images
- PDFBox (incubation in discussion): extracts and embeds XMP metadata
from/in PDF files (see also JempBox)

Every one of these projects has its own means to represent metadata in
memory. Wouldn't it make sense to have a common approach? I've worked
with XMP for some time now and I can say it's ideal to work with. It
(Continue reading)

Antoni Mylka | 20 Nov 2007 15:25
Picon

Re: Metadata use by Apache Java projects

Hi Jeremias, tika-dev

My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
which is addressing similar things as Tika, we got your mail on the
tika-dev mailing list. I also work for the Nepomuk Social Semantic
Desktop project, I'm the maintainer of the Nepomuk Information Element
Ontology. More below.

Your mail addresses four more-or-less orthogonal issues.

1. The standardization of schemas, how the metadata should be
represented i.e. URIs of classes and properties.

2. The standardzation of the representational language This means the
conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the
formal semantics.

3. The standardization of the API that will work with the RDF triples
and handle operations such as adding, deleting and querying triples.
(And maybe the inference).

4. The standardization of the RDF storage mechanisms.

XMP provides its answers to all these questions but they aren't the only
ones. I know of at least two such standardization initiatives,

1. Freedesktop.org the XESAM project. A gathering of the major
open-source desktop search engines
http://xesam.org/main

(Continue reading)

Jeremias Maerki | 21 Nov 2007 08:28
Picon
Gravatar

Re: Metadata use by Apache Java projects

Hi Antoni

Thanks for the interesting information. Frankly, you've scared me there
just a bit. It's interesting to see that there are so encompassing
efforts underway in some places. To me, full RDF still has a scare
factor. At least the subset XMP provides is "manageable" for mere
mortals. :-) At least, that's my impression. Maybe I still just know too
little about RDF. IMO, XMP finds a good compromise between
expressiveness and simplicity. The positive points for Adobe's XMP
toolkit: it is in Java, available now and under a license we can easily
use in Apache projects.

In your point 4, you mention some restrictions you see for XMP. But XMP
is a subset of RDF, so does RDF really restrict you from an RDF point of
view? I didn't really understand that point.

We'll see how this works out.

Jeremias Maerki

On 20.11.2007 15:25:44 Antoni Mylka wrote:
> Hi Jeremias, tika-dev
> 
> My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
> which is addressing similar things as Tika, we got your mail on the
> tika-dev mailing list. I also work for the Nepomuk Social Semantic
> Desktop project, I'm the maintainer of the Nepomuk Information Element
> Ontology. More below.
> 
> Your mail addresses four more-or-less orthogonal issues.
(Continue reading)

Chris Mattmann | 20 Nov 2007 18:22
Picon
Picon
Favicon

Re: Metadata use by Apache Java projects

Hi Antoni,

> Chris Mattman has written that it's
necessary to
> strike a balance between functionality and over-bloating.
 From my own
> experience i can say that it is VERY difficult :).

Well from my own experience I can tell you that it *is* difficult, but
certainly doable.

I've been working with different forms of metadata (Dublin Core, ISO 11179,
RDF, OWL/etc.), been involved in international standards organizations
(CCSDS, ISO) who are developing metadata standards, and worked on several
projects that deal with metadata (Object Oriented Data Technology [OODT],
Semantic Web for Earth and Environmental Terminology [SWEET]) in different
domains (earth science, planetary science, space science, cancer
research/etc.) for almost 7 years now.

Sure, there are a lot of standards and people can talk about coming up with
a one-size-fits-all cookie cutter type library for these capabilities,
however, I think it's important to understand that developing such libraries
(rather than striking the balance) in my mind is the most difficult problem
to tackle. I think that in the end, all we can do as software developers, as
people who are trying to standardize metadata, is to try and develop core
libraries and functions that others can build upon for their own needs. I
don't think the Tika folks should be in the business of trying to develop
high capability metadata libraries, because in the end, just as everyone is
saying, those need to be tailored to a specific use-case or domain. On the
other hand, I think it's a much-more attainable goal to come up with a
(Continue reading)

Jukka Zitting | 19 Nov 2007 17:54
Picon
Gravatar

Re: Metadata use by Apache Java projects

Hi,

[Responding just on tika-dev <at> . I guess Jeremias follows all these
forums, and can summarize in the end...]

On Nov 19, 2007 11:26 AM, Jeremias Maerki <dev <at> jeremias-maerki.ch> wrote:
> Every one of these projects has its own means to represent metadata in
> memory. Wouldn't it make sense to have a common approach?

+1

> Sanselan and Tika have both chosen a very simple approach but is it
> versatile enough for the future? While the simple Map<String, String[]> in
> Tika allows for multiple authors, for example, it doesn't support
> language alternatives for things such as dc:title or dc:description.

IMHO it would be good to have a more flexible metadata model in Tika.
Better yet if it's a standard used across multiple projects. Best if
we don't need to implement it in Tika. :-)

> My questions:
> - Any interest in converging on a unified model/approach?

Certainly.

> - If yes, where shall we develop this? As part of Tika (although it's
> still in incubation)? As a seperate project (maybe as Apache Commons
> subproject)? If more than XML Graphics uses this, XML Graphics is
> probably not the right home.
> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
(Continue reading)

Chris Mattmann | 19 Nov 2007 18:27
Picon
Picon
Favicon

Re: Metadata use by Apache Java projects

Hi Folks,

>> Sanselan and Tika have both chosen a very simple approach but is it
>> versatile enough for the future? While the simple Map<String, String[]> in
>> Tika allows for multiple authors, for example, it doesn't support
>> language alternatives for things such as dc:title or dc:description.
> 
> IMHO it would be good to have a more flexible metadata model in Tika.
> Better yet if it's a standard used across multiple projects. Best if
> we don't need to implement it in Tika. :-)

I'm not quite sure I understand how Tika's metadata model isn't flexible
enough? Of course, I'm a bit bias, but I'm really trying to understand here
and haven't been able to. I think it's important to realize that a balance
must be struck between over-bloating a metadata library (and attaching on
RDF support, inference, synonym support, etc.) and making sure that the
smallest subset of it is actually useful.

Also, I'd be against moving Metadata support out of Tika because that was
one of the project's original goals (Metadata support), and I think it's
advantageous for Tika to be a provider for a Metadata capability (of course,
one related to document/content extraction).

I'm wondering too what it means that Tika doesn't support "language
alternatives"? Do you mean synonyms? Also, you mention it's relatively easy
in other libraries to map between different file format metadata. I think
that this is fairly easy to do in Tika too, seeing as though its primary
purpose is support metadata extraction from different file formats.

> 
(Continue reading)

Jeremias Maerki | 20 Nov 2007 09:06
Picon
Gravatar

Re: Metadata use by Apache Java projects

Hi Chris

On 19.11.2007 18:27:56 Chris Mattmann wrote:
> Hi Folks,
>  
> >> Sanselan and Tika have both chosen a very simple approach but is it
> >> versatile enough for the future? While the simple Map<String, String[]> in
> >> Tika allows for multiple authors, for example, it doesn't support
> >> language alternatives for things such as dc:title or dc:description.
> > 
> > IMHO it would be good to have a more flexible metadata model in Tika.
> > Better yet if it's a standard used across multiple projects. Best if
> > we don't need to implement it in Tika. :-)
> 
> I'm not quite sure I understand how Tika's metadata model isn't flexible
> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> and haven't been able to. I think it's important to realize that a balance
> must be struck between over-bloating a metadata library (and attaching on
> RDF support, inference, synonym support, etc.) and making sure that the
> smallest subset of it is actually useful.

I'm sorry. I didn't intend to stand on anyone's toes.

At any rate, I'm not talking about full RDF support. I'm talking about
XMP, which uses only a subset of RDF.

> Also, I'd be against moving Metadata support out of Tika because that was
> one of the project's original goals (Metadata support), and I think it's
> advantageous for Tika to be a provider for a Metadata capability (of course,
> one related to document/content extraction).
(Continue reading)

Chris Mattmann | 20 Nov 2007 18:06
Picon
Picon
Favicon

Re: Metadata use by Apache Java projects

Hi Jeremias,

>> I'm not quite sure I understand how Tika's metadata model isn't flexible
>> enough? Of course, I'm a bit bias, but I'm really trying to understand here
>> and haven't been able to. I think it's important to realize that a balance
>> must be struck between over-bloating a metadata library (and attaching on
>> RDF support, inference, synonym support, etc.) and making sure that the
>> smallest subset of it is actually useful.
> 
> I'm sorry. I didn't intend to stand on anyone's toes.
> 
> At any rate, I'm not talking about full RDF support. I'm talking about
> XMP, which uses only a subset of RDF.

Great, and I wouldn't worry about stepping on anyone's toes. You certainly
didn't step on mine. My point was, at some point, we're just building
libraries on top of libraries on top of...well you get the picture. What I'm
interested in is building the smallest metadata library that's actually
useful and can be built upon to add higher level capabilities, just as Solr
builds on top of Lucene to provide faceted search, etc. Lucene itself
doesn't provide a means for understanding facets/etc., but provides a
library for text/indexing: Solr adds that understanding. Similarly here, I
think it would be great for Tika to provide a library to handle Metadata
representation/access, and then for others, to build on top of it to provide
higher level library support (RDF access/etc.).

> 
>> Also, I'd be against moving Metadata support out of Tika because that was
>> one of the project's original goals (Metadata support), and I think it's
>> advantageous for Tika to be a provider for a Metadata capability (of course,
(Continue reading)

Jeremias Maerki | 21 Nov 2007 08:52
Picon
Gravatar

Re: Metadata use by Apache Java projects

Hi Chris

On 20.11.2007 18:06:25 Chris Mattmann wrote:
> Hi Jeremias,
> 
> >> I'm not quite sure I understand how Tika's metadata model isn't flexible
> >> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> >> and haven't been able to. I think it's important to realize that a balance
> >> must be struck between over-bloating a metadata library (and attaching on
> >> RDF support, inference, synonym support, etc.) and making sure that the
> >> smallest subset of it is actually useful.
> > 
> > I'm sorry. I didn't intend to stand on anyone's toes.
> > 
> > At any rate, I'm not talking about full RDF support. I'm talking about
> > XMP, which uses only a subset of RDF.
> 
> Great, and I wouldn't worry about stepping on anyone's toes. You certainly
> didn't step on mine. My point was, at some point, we're just building
> libraries on top of libraries on top of...well you get the picture. What I'm
> interested in is building the smallest metadata library that's actually
> useful and can be built upon to add higher level capabilities, just as Solr
> builds on top of Lucene to provide faceted search, etc. Lucene itself
> doesn't provide a means for understanding facets/etc., but provides a
> library for text/indexing: Solr adds that understanding. Similarly here, I
> think it would be great for Tika to provide a library to handle Metadata
> representation/access, and then for others, to build on top of it to provide
> higher level library support (RDF access/etc.).

I think Adobe's XMP toolkit accomplishes exactly that, at least for the
(Continue reading)


Gmane