Christof Mueller | 5 Dec 16:30
Picon
Favicon

Re: Lucene cas consumer

Jörn Kottmann wrote:
> I am also interested in a Lucene CAS consumer.
> Maybe we can work together and set up a sandbox project ?
>
> Jörn
Hi Jörn,

we would be happy to contribute the code of the example Lucene CAS
consumer as base for the sandbox project.

Christof

--

-- 
Christof Müller
UKP Lab
Technische Universität Darmstadt
http://www.ukp.tu-darmstadt.de

Roberto Franchini | 5 Dec 23:37
Picon
Gravatar

Re: Lucene cas consumer

On Fri, Dec 5, 2008 at 4:30 PM, Christof Mueller
<mueller@...> wrote:
> Jörn Kottmann wrote:
>> I am also interested in a Lucene CAS consumer.
>> Maybe we can work together and set up a sandbox project ?
>>
>> Jörn
> Hi Jörn,
>
> we would be happy to contribute the code of the example Lucene CAS
> consumer as base for the sandbox project.
>
> Christof
>

I've got an index!!!!
Yes, mixing some code from the JENA lucas (I kept it in a dust corner
of my harddisk :) ), some from DK and some mine, i produce an index.
If we want to start a Lucene indexer that's not only a proof of
concept but something very useful, it should be
configurable/exetendable.
The "problem", that's the UIMA's power,  is that everyone has it's own
type system.
To produce a lucene document one extract information from some
features, applying the right analyzer. In my case I use maybe only 10%
of the annotation produced by the analysis pipeline to produce a
single lucene doc.
So we need a very highly configurable component, able to map only
certain declared features and applying the right analyzer and so on.
Mny ways are possible:
(Continue reading)

Jörn Kottmann | 6 Dec 00:40
Picon

Re: Lucene cas consumer

> The "problem", that's the UIMA's power,  is that everyone has it's own
> type system.
> To produce a lucene document one extract information from some
> features, applying the right analyzer. In my case I use maybe only 10%
> of the annotation produced by the analysis pipeline to produce a
> single lucene doc.
> So we need a very highly configurable component, able to map only
> certain declared features and applying the right analyzer and so on.
> Mny ways are possible:
> -completly programmatic: the indexer is abstract and should be
> extended to implement the right mapping for a specialized typeSytem
> and pipeline
> -configurable: mapping rules are defined in a descriptor file; the
> JENA component followed this way

I prefer mapping rules in the descriptor. These rules have to be
adjusted by many users to make them compatible with
their type system. Hard coding the mapping rules makes
this task more difficult.

As far as I know was this approach also chosen by the
regex annotator in the sandbox.

Jörn
Christof Mueller | 6 Dec 03:29
Picon
Favicon

Re: Lucene cas consumer

Jörn Kottmann wrote:
>> The "problem", that's the UIMA's power,  is that everyone has it's own
>> type system.
>> To produce a lucene document one extract information from some
>> features, applying the right analyzer. In my case I use maybe only 10%
>> of the annotation produced by the analysis pipeline to produce a
>> single lucene doc.
>> So we need a very highly configurable component, able to map only
>> certain declared features and applying the right analyzer and so on.
>> Mny ways are possible:
>> -completly programmatic: the indexer is abstract and should be
>> extended to implement the right mapping for a specialized typeSytem
>> and pipeline
>> -configurable: mapping rules are defined in a descriptor file; the
>> JENA component followed this way
>
> I prefer mapping rules in the descriptor. These rules have to be
> adjusted by many users to make them compatible with
> their type system. Hard coding the mapping rules makes
> this task more difficult.
>
> As far as I know was this approach also chosen by the
> regex annotator in the sandbox.

Another approach would be to use an additional annotator for mapping
type systems. The annotator would take tokens, stems, named entities or
what ever you want to index and map them on annotations of a certain
type, e.g., IndexTerm, which would be indexed by the consumer. During
the mapping process, the annotator could also perform some kind of
filtering by taking part-of-speech or stop word annotations into account.
(Continue reading)

Tong Fin | 6 Dec 04:50
Picon

Re: Lucene cas consumer

Related to the type mapping's topic:

The "Simple Server" in the Sandbox (contributed by Thilo et al.) also has
the notion of type mapping. Its main goal is to make UIMA output "easily
consumable" by other tools without doing UIMA programming. The mapping is
specified in an XML descriptor and, under the cover, it uses xml bean
(JSR-173) to do the mapping from user-types to UIMA types.
When I did the work to extend this Simple Server to support UIMA-AS, I also
investigated the approaches related to "type mapping".

From my investigation and prototyping, I have the following possibilities do
the type mapping (with constraints or filters):
1. xml-based descriptor (as Simple Server)
2. Ecore + OCL (Object Constraint Language) - if you like modeling :)
3. Script based on JSR-223 (i.e, JavaScript, Groovy, ...)

 Tong
jochen.leidner | 5 Dec 23:55
Favicon

RE: Lucene cas consumer

Roberto,

Maybe certain standard annotations can (and should) be standardized,
since that's
also a pre-condition for having a CPAN-like UIMA repository of re-usable
components.

Best
Jochen

--
Dr. Jochen Leidner
Research Scientist

Thomson Reuters 
Research & Development
610 Opperman Drive
Minneapolis/St. Paul, MN 55123
USA

http://www.ThomsonReuters.com

-----Original Message-----
From: Roberto Franchini [mailto:ro.franchini@...] 
Sent: Friday, December 05, 2008 4:38 PM
To: uima-user@...
Subject: Re: Lucene cas consumer

[...]
The "problem", that's the UIMA's power,  is that everyone has it's own
(Continue reading)


Gmane