Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA

Marshall Schor Wed, 23 Aug 2006 12:22:27 -0700

Hello everyone - 

My colleagues and I are submitting this proposal to the community for a 
new project in the incubator, and look forward to starting to work with 
this community.


Proposal for Incubation Project: Unstructured Information Management 
Architecture - UIMA

The Unstructured Information Management Architecture (UIMA) is an 
architecture and software framework for creating, discovering, composing 
and deploying a broad range of multi-modal analysis capabilities.  We 
propose a project to develop, implement, support and enhance UIMA 
framework implementations that comply with the UIMA standard (being put 
forward concurrently for standardization within OASIS 
http://www.oasis-open.org - not yet submitted, but we plan to do this 
early in September.). 

The proposal includes both a UIMA framework, as well as tools to develop, 
describe, compose and deploy UIMA-based components and applications. The 
initial work will be based on the UIMA Version 2 framework code developed 
by IBM; snapshots of each release of this code are currently made 
available on http://sourceforge.net/projects/uima-framework. The 
SourceForge versions would be stabilized in maintenance mode, if we are 
successful in moving to Apache. 

The framework provides a run-time environment in which developers can plug 
in and run their UIMA component implementations and with which they can 
build and deploy UIM applications. The framework is not specific to any 
IDE or platform. 

Motivation for UIMA: Databases are core components of nearly all 
applications; they store information in structured tables.  But more and 
more of the available digital data is unstructured (e.g. email, web 
documents, images, audio clips, video streams) with little information 
(metadata) attached to explain its content or context.  Although many 
applications have been built to process unstructured data, they have 
either managed it as a BLOB or they have developed isolated applications 
for analyzing the content.  In the absence of a standardized means for 
analytical applications to share insights extracted from the content, 
analytical applications cannot build upon one another. As a result, the 
industry has barely begun to tap the value locked in unstructured 
information.

Standardization is key to achieve component interoperability, with 
capabilities to mix components developed in different places and in Java, 
C++ and other languages.  The Unstructured Information Management 
Architecture defines standards for component interoperability and 
application composition that will provide this needed unifying standard, 
and allow a variety of framework implementations to exist, while 
preserving the goal of unstructured information analytic component reuse.

This project provides both:
*  UIMA frameworks that provide runtime environments into which the 
developers can plug in and run their UIMA component implementations and 
*  Tooling for the development, description, composition and deployment of 
UIMA components and applications. 

It will follow and conform to the emerging work on the UIMA standard being 
proposed as a new standards effort to the OASIS standards organization; we 
expect to submit this proposal to OASIS in early September.  OASIS has an 
open approach to granting Technical Committee voting rights to members of 
OASIS, described here: 
http://www.oasis-open.org/committees/process.php#2.4

UIMA was built to help developers create solutions that get more value 
from unstructured information more quickly and at lower cost by making it 
easy to reuse and combine analytic modules from different sources into new 
analytic applications. The architecture and the framework have been 
validated through work with USA's DARPA which is using it as a standard 
for key projects with several universities involved in advanced 
linguistics analysis, such as Carnegie Mellon, Columbia, Stanford and 
University of Massachusetts.  Other companies, such as the Mayo Clinic and 
Sloan Kettering, are also building efforts around UIMA.  In addition, over 
15 software vendors, including companies such as Inxight, Attensity, 
ClearForest, Temis, SPSS, SAS, Cognos, Endeca, Factiva and others, 
announced plans to support UIMA.

The UIMA framework (binary and/or source code) has been downloaded over 
8000 times from IBM alphaWorks (http://www.alphaworks.ibm.com/tech/uima) 
or SourceForge  (http://uima-framework.sourceforge.net). 

We believe that moving the UIMA framework development to the Apache 
development community will lead to faster innovation, better integration 
with other open source software, and broader adoption of UIMA, 
accelerating the industry's ability to get the most value from text, 
audio, and video content. The UIMA framework is becoming attractive to 
developers who want to build components; we believe that having UIMA on 
Apache will encourage the development of a basic set of open source 
components that will jumpstart these developers' efforts. One of the first 
components we see possible synergy with is a search component based on 
Apache Lucene that would enable semantic search.  We like the concept of 
the Lucene Sandbox as a way to encourage innovation around UIMA, and would 
envision something similar for this project.

Some initial work we see in the incubator include the following:
* redoing the parts of the tooling that were done as derivative works of 
Eclipse source code, to enable everything to be licensable under the 
Apache license
* extending the framework to better support "scale-out"
* extending the framework to better align with the emerging UIMA Standards 
work
* extending the framework to support XMI-based SOAP and/or other service 
interfaces
* extending the framework to support OSGi-based approaches to 
componentization and packaging
* exploring embeddings of the framework within other interested Apache 
projects, including synergies with Lucene
* providing aids to the community to migrate from previous versions of the 
framework to the Apache version
* setting up community support: hosting a facility similar to the Lucene 
Sandbox to encourage innovation and experimentation; establishing a wiki 
and some process to allow better documentation to be developed by the 
community, and linking our existing XHTML documentation via an XSL 
transform to Apache FOP

?       Section 0.1 : Criteria

?       Community: 
Currently, the UIMA Framework development is being done by IBM, with input 
from a group of early adopters in industry and government.  Going forward, 
we see IBM continuing to support several committers working on it.  We 
have already begun talking with other people outside of IBM that have 
expressed interest in contributing towards the development.  This includes 
members of academic institutions, people working for some of the software 
vendors that have announced plans to support UIMA, and others from 
companies that have expressed interest since initial announcements about 
our open source plans.  Multiple non-IBM people have already expressed 
desires to become committers.

?       Core Developers: 
The previous core developers of UIMA are Adam Lally, Thilo Goetz, Marshall 
Schor, Edward Epstein, Jaroslaw Cwiklik and Thomas Hampp.   Many others 
have also contributed.

?       Alignment: 
UIMA has significant synergy with search applications, and we expect to 
see integration with Lucene in the future. UIMA makes use of the Apache 
Portable Runtime (APR) for C++ support.  It is designed to be embeddable 
into other frameworks, such as web application servers.  Part of UIMA is 
Eclipse-based tooling.  We use ANT for build scripting.   UIMA has support 
for various language bindings including C++ and Java; we also have more 
limited bindings for Perl, Python, and TCL.  UIMA uses Web Services as 
part of its approach to wiring up components in its domain.  It makes use 
of XML services such as Xerces and Xalan. 

?       License: 
The current license for the source code is CPL, with a small number of 
files licensed under the EPL (Eclipse Public License), because these were 
created as "derivative works" of existing Eclipse open source code.  When 
the code base is moved to Apache, it will be relicensed under the Apache 
license, except for the small number of files licensed under the EPL as 
derivative works of Eclipse source files.  We plan to work in the 
incubator to redo these parts, so the entire offering can be licensed 
under the Apache license.
The distribution for the C++ enablement layer includes open source 
components ICU (a Unicode package) which has its own license.  We plan to 
work with community to properly make use of this non-Apache licensed 
component. 
Our current vision for the future of UIMA has it aligning with and 
incorporating other standards-based open source components/protocols, some 
of which may have licensing other than the Apache license (for example, 
the XML Metadata Interchange (XMI), and the EMF ECore Model from Eclipse); 
we will work with the community in figuring out how to move forward on 
this. 

?       Orphaned Software:
UIMA has been in active development for 5 years.  The community of users 
has steadily grown, and there are now significant commercial and research 
organizations actively using it.  UIMA is embedded in IBM software 
products and is delivered through IBM services engagements. IBM has 
developers assigned to it, and is continuing to support its development. 
In addition, several people outside of IBM have already expressed interest 
in working on UIMA, and have been providing IBM with initial feedback. One 
of the objectives of starting this Apache project is to provide a 
meritocratic structure for those people to begin more actively 
contributing to UIMA. 

?       Experience With Open Source: 
The individuals working on this software have background as IBM software 
developers.  While many of them have experience working with open source 
software, none of them has had extensive experience contributing to other 
open source software.  However, IBM as an organization, has extensive 
experience contributing to open source projects and will make available 
resources to provide guidance to the developers working on this project.

?       Homogenous Developers: (work for same company?)
Currently all the developers work for IBM, although they come from 
different geographically dispersed organizations within IBM.  We will 
reach out during the incubation time to get others to contribute; we have 
already received interest from several parties.

?       Reliance on salaried developers: 
Currently the developers are paid employees of IBM.

?       No Ties to Other Apache Products: 
We make use of several Apache components (SOAP / Web Services, XML 
(Xerces, Xalan), languages (Perl), scripting languages (ANT), Apache 
Portable Runtime.  In addition, UIMA has been embedded in other 
frameworks, such as web application servers, and integrated with search 
engines.  We are exploring Lucene extensions that could take advantage of 
UIMA processed data.  We are currently investigating and prototyping some 
software packaging concepts based on OSGi; the Apache Incubator project 
Felix may have relevance as we go forward.  The documentation is being 
moved to XHTML and plans to use Apache FOP for producing PDF reference 
materials.

?       Achieving the Apache Brand is a Prominent Goal: 
UIMA is already being adopted by a wide cross section of users, both 
commercial and academic, world-wide. Our experience shows that analytic 
modules can be reused and combined through UIMA making it easier and 
faster for developers to build new analytic applications for specific 
industries or domains. Given the diversity of content and analytics that 
will be required to address the multitude of opportunities - from military 
intelligence to quality assurance to contact center analytics -- growing 
this infrastructure so that it better aligns with other major Open Source 
communities should help accelerate industry's ability to get value from 
content assets. 
We believe that the Apache community of developers has the experience, 
background, visibility, and synergistic resources to encourage and foster 
a vibrant developer community around this project.

?       Section 1 : Scope of the project
The project will develop implementations of the UIMA architecture (which 
is concurrently being submitted to the OASIS standards process), 
supporting the breadth of platforms that developers working in this field 
are using, including Java, C++, Perl, Python and TCL; and utilities and 
tooling to support component and application developers and assemblers / 
packagers.  It will initially include the Java UIMA framework for UIMA 
Version 2 (you can see a snap shot of the Version 2 release SourceForge; 
the delivered code would this code base plus normal incremental bug fixes 
and improvements), plus additional components (mainly documentation and 
test cases, which are not currently on SourceForge).  Over time, the 
project is expected grow to include supporting various embeddings and 
integrations with other Apache components such as search engines and web 
application frameworks. 
Over time, we envision the project becoming an umbrella for related 
open-source around UIMA, including things like open-source pre-annotated 
corpora, and hosting a facility similar to the Lucene Sandbox to encourage 
innovation and experimentation.
The UIMA framework is primarily a set of libraries (in Java, C++, Perl, 
etc.), test cases, and UIMA utilities and tools (scripts, plugins, 
executables, etc.) used to build, test and debug UIMA analytic components. 
 The tooling includes several Eclipse platform plugins.

?       Section 2 : Initial source from which the project is to be 
populated
The source currently is maintained in IBM internal software control 
systems.  At the time of launch, we plan to contribute the latest version 
of the code base (with some renaming of package prefixes to reflect 
apache.org), test cases, build files, and documentation, under the terms 
specified in the ASF Corporate Contributor License.  We plan to donate the 
existing C++ enablement layer and the support for Perl, Python, and TCL a 
few months later than the initial donation; this delay is to give us time 
to finish preparing that code base for Open Source.

?       Section 3: Identify the ASF resources to be created

?       Section 4: Identify the Initial Set of Committers
Michael Baessler ([EMAIL PROTECTED])
Edward Epstein ([EMAIL PROTECTED])
Thilo Goetz  ([EMAIL PROTECTED])
Adam Lally  ([EMAIL PROTECTED]) 
Marshall Schor ([EMAIL PROTECTED])
?       Section 5: Identify ASF Sponsor
?       Sponsor: 
We are requesting the Incubator to sponsor this.  Our current vision is 
that it will become a top level project (other projects that develop UIMA 
components could become subprojects, for instance).

?       Mentors: 
Sam Ruby ([EMAIL PROTECTED])
Ken Coar ([EMAIL PROTECTED])

?       Section 6: Open Issues for Discussion

-Marshall Schor  (msa at sign  schor dot com)

Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA

Reply via email to