4 Dec 19:36
Re: Lucene cas consumer
Hi all, I'm using both Lucene and UIMA in one project. Lucene is primarily an information retrieval API. It provides a framework and default implementations for analyzing several languages. Analyzing means tokenization, stop words, etc. Furthermore, it brings the key functionality to build an inverted index and to search it. Lucene can be extended easily. E.g. one can implement an analyzer that does lemmatization or that looks up synonyms in Wordnet and adds them to the index. What Lucene cannot do - or at least not without a lot of hacking - is aggregating analyses as UIMA can using the CAS. Usually your knowledge grows during an UIMA-based NLP-pipeline: you add the a token annotation, a lemma annotation, a POS-annotation and so on... In Lucene, you have the classical pipeline: the output replaces the input. (Yes, by subclassing Lucene's "Token" class, one can fiddle around the issue, but it is not elegant at all.) What makes Lucene + UIMA interesting for me is a simple fact: I can do all the NLP I want and be as flexible as I need in UIMA. Then I can feed the outcome (or rather: a small part of it) into a Lucene index. In my special case, I'm not using a CAS Consumer, but I can imagine other people would appreciate it in their application scenarios. To conclude: Lucene and UIMA aren't competitors, but in some cases having one feeding the other is what you want.(Continue reading)
RSS Feed