Niels Ott | 4 Dec 19:36
Picon
Favicon
Gravatar

Re: Lucene cas consumer

Hi all,

I'm using both Lucene and UIMA in one project.

Lucene is primarily an information retrieval API. It provides a
framework and default implementations for analyzing several languages.
Analyzing means tokenization, stop words, etc. Furthermore, it brings
the key functionality to build an inverted index and to search it.

Lucene can be extended easily. E.g. one can implement an analyzer that
does lemmatization or that looks up synonyms in Wordnet  and adds them
to the index.

What Lucene cannot do - or at least not without a lot of hacking - is
aggregating analyses as UIMA can using the CAS. Usually your knowledge
grows during an UIMA-based NLP-pipeline: you add the a token annotation,
a lemma annotation, a POS-annotation and so on...  In Lucene, you have
the classical pipeline: the output replaces the input. (Yes, by
subclassing Lucene's "Token" class, one can fiddle around the issue, but
it is not elegant at all.)

What makes Lucene + UIMA interesting for me is a simple fact: I can do
all the NLP I want and be as flexible as I need in UIMA. Then I can feed
the outcome (or rather: a small part of it) into a Lucene index.

In my special case, I'm not using a CAS Consumer, but I can imagine
other people would appreciate it in their application scenarios.

To conclude: Lucene and UIMA aren't competitors, but in some cases 
having one feeding the other is what you want.
(Continue reading)

Grant Ingersoll | 11 Dec 21:51
Picon
Favicon
Gravatar

Re: Lucene cas consumer

Coming late to the conversation...  Just offering some Lucene  
perspective

On Dec 4, 2008, at 1:36 PM, Niels Ott wrote:

> What Lucene cannot do - or at least not without a lot of hacking - is
> aggregating analyses as UIMA can using the CAS. Usually your knowledge
> grows during an UIMA-based NLP-pipeline: you add the a token  
> annotation,
> a lemma annotation, a POS-annotation and so on...  In Lucene, you have
> the classical pipeline: the output replaces the input. (Yes, by
> subclassing Lucene's "Token" class, one can fiddle around the issue,  
> but
> it is not elegant at all.)
>

You might find the TeeTokenFilter and SinkTokenizer interesting for  
mapping/aggregating tokens/extractions out to other fields in Lucene.

Also, Lucene is getting more flexible in terms of indexing and  
searching.   You can attach payloads to terms (i.e. byte arrays) which  
can provide some crude annotation storage and https://issues.apache.org/jira/browse/LUCENE-1422 
  and a couple of other issues are the start of more flexibility to  
add attributes that can then be indexed.  We're still working on the  
search side of it, but I think you will see more in the way of  
flexible indexing in the coming months that should be a nice win for  
UIMA + Lucene users.

> What makes Lucene + UIMA interesting for me is a simple fact: I can do
> all the NLP I want and be as flexible as I need in UIMA. Then I can  
(Continue reading)

Greg Holmberg | 6 Dec 02:20
Picon

Re: Lucene cas consumer


 -------------- Original message ----------------------
From: "Roberto Franchini" <ro.franchini@...>

> So we need a very highly configurable component, able to map only
> certain declared features and applying the right analyzer and so on.
> Mny ways are possible:
> -completly programmatic: the indexer is abstract and should be
> extended to implement the right mapping for a specialized typeSytem
> and pipeline
> -configurable: mapping rules are defined in a descriptor file; the
> JENA component followed this way
> -mix of the two: some mapping is configured, other are implemented

I seem to remember that IBM's CAS Consumer for indexing into their semantic search engine had to solve the
same problem.  I think it was configurable in a file, if I remember correctly.

Perhaps one of the IBM folks could describe what was done there?

A separate question: what kinds of annotations is it possible to index into Lucene?  In other words, what
functionality are we shooting for?

For example, can I index named entities?  In my case, named entities look like that attached UML class
diagram.  I would like to perform queries for documents that contain certain entities or types of
entities.  For example, find documents that contain entity name=IBM, type=Company.

Greg Holmberg

Adam Lally | 7 Dec 18:22
Picon

Re: Lucene cas consumer

On Fri, Dec 5, 2008 at 8:20 PM, Greg Holmberg <holmberg2066@...> wrote:
> I seem to remember that IBM's CAS Consumer for indexing into their semantic search engine had to solve the
same problem.  I think it was configurable in a file, if I remember correctly.
>
> Perhaps one of the IBM folks could describe what was done there?
>

Yes, that's right.  There's a separate file that contains the
configuration rules for the indexer.  This is described in the UIMA
documentation:

http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.application.integrating_text_analysis_and_search

However, the search engine that is used for this (available on IBM
alphaWorks) is able to index annotations over spans of text, which
AFAIK Lucene is not.

 -Adam

Marshall Schor | 8 Dec 15:41

Re: Lucene cas consumer


Adam Lally wrote:
> On Fri, Dec 5, 2008 at 8:20 PM, Greg Holmberg
<holmberg2066@...> wrote:
>   
>> I seem to remember that IBM's CAS Consumer for indexing into their semantic search engine had to solve the
same problem.  I think it was configurable in a file, if I remember correctly.
>>
>> Perhaps one of the IBM folks could describe what was done there?
>>
>>     
>
> Yes, that's right.  There's a separate file that contains the
> configuration rules for the indexer.  This is described in the UIMA
> documentation:
>
> http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.application.integrating_text_analysis_and_search
>
> However, the search engine that is used for this (available on IBM
> alphaWorks) is able to index annotations over spans of text, which
> AFAIK Lucene is not.
>   
You can see more about this by googling:
   semantic search engine
or
   semantic search engine uima

to see what others have worked on previously in this area.

-Marshall
(Continue reading)

Joachim Wermter | 9 Dec 09:51
Picon
Picon
Favicon

Re: Lucene cas consumer

Dear UIMA-Users,

at the JULIE Lab, we've been working (silently) on a new (and completely
altered) version of our Lucene CAS Indexer consumer (Lucas). We are
planning to make this available soon -- preferably in the UIMA sandbox. 
In fact, LUCAS now is able to perform offset-based token stream
alignment and merging of UIMA annotations (via position increment) in
the same Lucene field (e.g. "documenttext" or "title"), which we feel is
more appropriate for text indexing -- instead of putting each UIMA
annotation into a separate field like the Solr approach (still possible
with the new LUCAS).

At the heart for the user is a flexible XML-based "mapping configuration
file" in which the user can determine which UIMA annotations should be
put into which Lucene field, and how this field is set up (e.g.
TOKENIZED or Stored). In addition, some basic functionality for hypernym
indexing is provided. A sample mapping file is appended to illustrate this.

What we lack at the moment is a thorough documentation of the code, and
more critical, a DTD describing the mapping (we will try to deliver this
asap).

By putting this into the sandbox, we hope the UIMA community will
embrace this tool and help to develop it further. Any immediate feedback
will be very welcome!

Best wishes,
Rico Landefeld
Joachim Wermter

(Continue reading)

Roberto Franchini | 9 Dec 13:23
Picon
Gravatar

Re: Lucene cas consumer

On Tue, Dec 9, 2008 at 9:51 AM, Joachim Wermter
<Joachim.Wermter@...> wrote:
> Dear UIMA-Users,
>
[cut[
>
> By putting this into the sandbox, we hope the UIMA community will
> embrace this tool and help to develop it further. Any immediate feedback
> will be very welcome!
>

I hope to see it soon in the sandbox. The mapper file seems to be very usefull.
When are you planning to release?
Is it possible to download a preview from your site (julie.de, is it right?).
Thanks in advance, best regrads,
Roberto

--

-- 
Roberto Franchini
http://www.celi.it
http://www.blogmeter.it
http://www.memesphere.it
Tel +39-011-6600814
jabber:ro.franchini@... skype:ro.franchini

Rico Landefeld | 10 Dec 18:16
Picon
Picon
Favicon

Re: Lucene cas consumer

>
> I hope to see it soon in the sandbox. The mapper file seems to be very usefull.
> When are you planning to release?
> Is it possible to download a preview from your site (julie.de, is it right?).
> Thanks in advance, best regrads,
> Roberto
>
>   
We try our best, but the code and especially the mapping file format is 
roughly undocumentated. So we have to add the documentation first. But 
we try to release Lucas up to the middle of january, maybe sooner.

Regards,
Rico Landefeld

--

-- 
------------------------------------------
Rico Landefeld
Jena University Language and Information Engineering (JULIE) Lab
+49-3641-9 44324
http://www.julielab.de

Olivier Terrier | 5 Dec 09:44
Favicon

RE: Lucene cas consumer

Hi all

We, at Temis, have also made a prototype integration of Lucene and UIMA as a proof of concept.
More exactly we have written a Solr Cas consumer.
Solr http://lucene.apache.org/solr/ is a Lucene sub project that provide a kind of indexation server
layer on top of Lucene.
The idea behind was to be able to index documents using a UIMA processing chain with both full-text and
entities based on UIMA annotations.
More over Solr provides a support for 'faceted search' that can be based on annotation.
Let's suppose you have a UIMA typesystem that defines annotations like Person, Company, Location etc...
You can easily index these entities into a lucene index using the Solr java API.
In the prototype we also used a Solr contribution (not already integrated in the trunk) names solr-ui
available here
https://issues.apache.org/jira/browse/SOLR-634
It provides a simple UI to serach into your indexed documents using a combination of full text and facets
(look at attached screenshot).
Of course our Solr consumer is for now a very basic piece of code: for example it is tightly linked to our own
typesystem but we would be more than happy to collaborate with the communtiy on this subject if there is interest.

Regards

Olivier Terrier
Temis

> -----Message d'origine-----
> De : Niels Ott [mailto:nott@...] 
> Envoyé : jeudi 4 décembre 2008 19:37
> À : uima-user@...
> Cc : Roberto Franchini
> Objet : Re: Lucene cas consumer
(Continue reading)

Greg Holmberg | 4 Dec 19:12
Picon

Re: Lucene cas consumer

Roberto--

It does seem like there should be a close relationship between the two groups.

I don't know much about Lucene--can you educate me?  For example, have you given any thought to what to do with
UIMA annotations?  From what little I've read about Lucene, they seem to have a thing called a document
analyzer, but they don't mean the same thing we mean by analysis in the NLP community.  They appear to mean
something more like "tokenizer".  So I haven't yet found a place to put UIMA annotations, say for example,
named entities or parts of speech.  I'm wondering if Lucene needs a major feature enhancement before its
truly useful with UIMA?

What are your thoughts on how the integrate the two?  What functionality is possible?

Greg Holmberg

 -------------- Original message ----------------------
From: "Roberto Franchini" <ro.franchini@...>
> Hi,
> I'm going to write a Lucene CAS consumer. The porpouse is to create a
> Lucene document, or more than one, for each CAS.
> Last year (2007)  the JENA university lab (JULIE lab? is it right?)
> delivered such a component, named LUCAS. Then it disappeared.
> LUCAS seems a good piece of software.
> The Technische Universit�t Darmstadt developed one too:
> http://www.ukp.tu-darmstadt.de/projects/dkpro/. (I will write to
> them).
> 
> There's anybody interested to share knowledge and/or code to do that component?
> I think that Lucene and UIMA can be very good friends :)
> 
(Continue reading)

Dan McCreary | 4 Dec 20:32
Picon

Re: Lucene cas consumer

Hello,

I am somewhat new to UIMA so I apologize if I misunderstand some things.
But this is a very interesting question for me.

I see Lucene as a very wildly adopted but *Java-only framework* of tools for
building and maintaining keyword *indexes *on many types of documents.
Lucene also has great support for HADOOP and MapForce-type saleability.  But
Lucene is also designed to work with many front end tools like POI libraries
to extract text from Microsoft Word, Excel, PowerPoint etc.

I see Apache UIMA as a general purpose *analytic pipeline architecture *with
the strengths of a very advanced common in-memory processing model.

I thin there is a huge win-win for both projects if we can make UIMA enrich
text documents with entities before they are indexed by Lucene and also make
these tools much easier to install and work together.  You should not have
to be a Java developer just to install these tools and have them index and
search our file systems.

I have spent many hours trying to get UIMA to work without success.  Perhaps
it has to do with trying to get it to work on a 64 bit Vista....  :-O

- Dan

On Thu, Dec 4, 2008 at 12:12 PM, Greg Holmberg <holmberg2066 <at> comcast.net>wrote:

> Roberto--
>
> It does seem like there should be a close relationship between the two
(Continue reading)

Marshall Schor | 5 Dec 19:32

Re: Lucene cas consumer


Dan McCreary wrote:
> Hello,
>
> I am somewhat new to UIMA so I apologize if I misunderstand some things.
> But this is a very interesting question for me.
>
> I see Lucene as a very wildly adopted but *Java-only framework* of tools for
> building and maintaining keyword *indexes *on many types of documents.
> Lucene also has great support for HADOOP and MapForce-type saleability.  But
> Lucene is also designed to work with many front end tools like POI libraries
> to extract text from Microsoft Word, Excel, PowerPoint etc.
>
> I see Apache UIMA as a general purpose *analytic pipeline architecture *with
> the strengths of a very advanced common in-memory processing model.
>
> I thin there is a huge win-win for both projects if we can make UIMA enrich
> text documents with entities before they are indexed by Lucene and also make
> these tools much easier to install and work together.  You should not have
> to be a Java developer just to install these tools and have them index and
> search our file systems.
>
> I have spent many hours trying to get UIMA to work without success.  Perhaps
> it has to do with trying to get it to work on a 64 bit Vista....  :-O
>   
We have UIMA running on 64 bit Linuxes.  Please consider starting
another thread about issues around getting it working on 64 bit Vista -
that could be quite useful to the community.

-Marshall
(Continue reading)


Gmane