[PROPOSAL] UIMA (Unstructured Information Management Architecture) Framework

Marshall Schor Thu, 21 Sep 2006 06:54:12 -0700

Hi everyone. I'm restarting the UIMA Proposal thread based on thecomments so far, with a revised proposal that more closely followshttp://incubator.apache.org/guides/proposal.html. The first paragraphwas rewritten to more clearly state what the proposal was, in plainerlanguage. It is also slightly updated, reflecting the submission ofUIMA to OASIS for standardization work.


Abstract:

UIMA is a component framework for the analysis of unstructured contentsuch as text, audio and video. It comprises an SDK and tooling forcomposing and running analytic components written in Java and C++.



Proposal:  Unstructured Information Management Architecture framework

Unstructured Information Management applications are software systemsthat analyze large volumes of unstructured information in order todiscover knowledge that is relevant to an end user. We propose UIMA, aframework and SDK for developing such applications. An example UIMapplication might ingest plain text and identify entities, such aspersons, places, organizations; or relations, such as works-for orlocated-at. UIMA enables such an application to be decomposed intocomponents, for example "language identification" -> "language specificsegmentation" -> "sentence boundary detection" -> "entity detection(person/place names etc.)". Each component must implement interfacesdefined by the framework and must provide self-describing metadata viaXML descriptor files. The framework manages these components and thedata flow between them. Components are written in Java or C++; the datathat flows between components is designed for efficient mapping betweenthese languages. UIMA additionally provides capabilities to wrapcomponents as network services, and can scale to very large volumes byreplicating processing pipelines over a cluster of networked nodes.

This framework has already attracted a following among government,commercial, and academic institutions who previously developed analysisalgorithms, but were unable to easily build on each other's works, andwho want to be able to evolve their applications by independentlyupgrading parts, as better technology becomes available. Applicationsbuilt with this framework are being used with plain text, audio streams,and mage/video streams, identifying entities and relations, convertingspeech to text, translating into different languages, and determiningproperties of images.

The UIMA framework runs components in a flow, passing a common dataobject containing unstructured information (free text, audio, video,etc.) through the components. Each component examines the unstructuredinformation and data added by other components, and adds data of itsown. The framework mandates a standardized form of the data beingpassed, and a standardized form of the interfaces to the components.We propose a project to develop, implement, support and enhance thisframework (and, over time, other implementations) that comply with theUIMA standard (which has been submitted for standardization work withinOASIS http://www.oasis-open.org. Members of this community areencouraged to participate in that effort, as well; OASIS has an openapproach to granting Technical Committee voting rights to members ofOASIS, described here: http://www.oasis-open.org/committees/process.php#2.4.

The proposal includes both the framework, as well as tools to develop,describe, compose and deploy UIMA-based components and applications. Theinitial work will be based on the UIMA Version 2 framework codedeveloped by IBM; snapshots of each release of this code are currentlymade available on http://sourceforge.net/projects/uima-framework. TheSourceForge versions would be stabilized in maintenance mode, if we aresuccessful in moving to Apache.The framework is not specific to any IDE or platform, and does notdepend on other middleware.

Background:

Databases are core components of nearly all applications; they storeinformation in structured tables. But more and more of the availabledigital data is unstructured (e.g. email, web documents, images, audioclips, video streams) with little information (metadata) attached toexplain its content or context. Although many applications have beenbuilt to process unstructured data, they have either managed it as aBLOB or they have developed isolated applications for analyzing thecontent. In the absence of a standardized means for analyticalapplications to share insights extracted from the content, analyticalapplications cannot build upon one another. As a result, the industryhas barely begun to tap the value locked in unstructured information.

Standardization is key to achieving component interoperability, withcapabilities to mix components developed in different places and inJava, C++ and other languages. The Unstructured Information ManagementArchitecture defines standards for component interoperability andapplication composition that will provide this needed unifying standard,and allow a variety of framework implementations to exist, whilepreserving the goal of unstructured information analytic component reuse.

UIMA was built to help developers create solutions that get more valuefrom unstructured information more quickly and at lower cost by makingit easy to reuse and combine analytic modules from different sourcesinto new analytic applications. The architecture and the framework havebeen validated through work with USA's DARPA which is using it as astandard for key projects with several universities involved in advancedlinguistics analysis, such as Carnegie Mellon, Columbia, Stanford andUniversity of Massachusetts. Other companies, such as the Mayo Clinicand Sloan Kettering, are also building efforts around UIMA. Inaddition, over 15 software vendors, including companies such as Inxight,Attensity, ClearForest, Temis, SPSS, SAS, Cognos, Endeca, Factiva andothers, announced plans to support UIMA.

The UIMA framework (binary and/or source code) has been downloaded over8000 times from IBM alphaWorks (http://www.alphaworks.ibm.com/tech/uima)or SourceForge (http://uima-framework.sourceforge.net).


Rationale:

We believe that moving the UIMA framework development to the Apachedevelopment community will lead to faster innovation, better integrationwith other open source software, and broader adoption of UIMA,accelerating the industry's ability to get the most value from text,audio, and video content. The UIMA framework is becoming attractive todevelopers who want to build components; we believe that having UIMA onApache will encourage the development of a basic set of open sourcecomponents that will jumpstart these developers' efforts. One of thefirst components we see possible synergy with is a search componentbased on Apache Lucene that would enable semantic search. We like theconcept of the Lucene Sandbox as a way to encourage innovation aroundUIMA, and would envision something similar for this project.



Initial Goals:

Some initial work we see in the incubator includes the following:

* redoing the parts of the tooling that were done as derivative works ofEclipse source code, to enable everything to be licensable under theApache license

* extending the framework to better support "scale-out"

* extending the framework to better align with the emerging UIMAStandards work* extending the framework to support XMI-based SOAP and/or other serviceinterfaces* extending the framework to support OSGi-based approaches tocomponentization and packaging* exploring embeddings of the framework within other interested Apacheprojects, including synergies with Lucene* providing aids to the community to migrate from previous versions ofthe framework to the Apache version* setting up community support: hosting a facility similar to the LuceneSandbox to encourage innovation and experimentation; establishing a wikiand some process to allow better documentation to be developed by thecommunity, and linking our existing XHTML documentation via an XSLtransform to Apache FOP

Current Status:

* Meritocracy:

Meritocracy seems to us an ideal way to grow the community of developersaround UIMA, it being a controlled, rational way to give those whopositively contribute, more ability to directly contribute. Thisapproach also gives contributors one of the best reasons to join thecommunity of volunteers - to be recognized for the merit of theircontributions.


* Community:

Currently, the UIMA Framework development is being done by IBM, withinput from a group of early adopters in industry and government. Goingforward, we see IBM continuing to support several committers working onit. We have already begun talking with other people outside of IBM thathave expressed interest in contributing towards the development. Thisincludes members of academic institutions, people working for some ofthe software vendors that have announced plans to support UIMA, andothers from companies that have expressed interest since initialannouncements about our open source plans. Multiple non-IBM people havealready expressed desires to become committers.


* Core Developers:

The previous core developers of UIMA are Adam Lally, Thilo Goetz,Marshall Schor, Edward Epstein, Jaroslaw Cwiklik and Thomas Hampp.Many others have also contributed. The developers come from both theResearch and Development parts of IBM.


* Alignment:

UIMA has significant synergy with search applications, and we expect tosee integration with Lucene in the future. UIMA makes use of the ApachePortable Runtime (APR) for C++ support. It is designed to be embeddableinto other frameworks, such as web application servers. Part of UIMA isEclipse-based tooling. We use ANT for build scripting. UIMA hassupport for various language bindings including C++ and Java; we alsohave more limited bindings for Perl, Python, and TCL. UIMA uses WebServices as part of its approach to wiring up components in its domain.It makes use of XML services such as Xerces and Xalan.

The development of UIMA has been based on merit with open discussionamong a distributed team of developers, from both Research andDevelopment organizations.

* License:

The current license for the source code is CPL, with a small number offiles licensed under the EPL (Eclipse Public License), because thesewere created as "derivative works" of existing Eclipse open sourcecode. When the code base is moved to Apache, it will be relicensedunder the Apache license, except for the small number of files licensedunder the EPL as derivative works of Eclipse source files. We plan towork in the incubator to redo these parts, so the entire offering can belicensed under the Apache license.

The distribution for the C++ enablement layer includes open sourcecomponents ICU (a Unicode package) which has its own license. We planto work with community to properly make use of this non-Apache licensedcomponent.Our current vision for the future of UIMA has it aligning with andincorporating other standards-based open source components/protocols,some of which may have licensing other than the Apache license (forexample, the Xml Metadata Interchange (XMI), and the EMF ECore Modelfrom Eclipse); we will work with the community in figuring out how tomove forward on this.

* Other IP:

When we requested OASIS to set up a Technical Committee chartered todevelop a platform-independent specification for text and multi-modalanalysis, we specified that it be set up under the "RF on Limited Terms"mode of the OASIS IP Policy. "RF" means Royalty Free, and the LimitedTerms means companies that are working with us on the TechnicalCommittee are restricted in adding additional terms.These are the most liberal terms and make any Essential Claims availableto ALL and ROYALTY FREE.

For the details please refer to:
- http://www.oasis-open.org/who/ipr/ipr_faq.php
- http://www.oasis-open.org/who/intellectualproperty.php

Ultimately of course, there is always a risk that someone in the worldholds a patent that can be claimed as Essential. The most any standardsorganization can do is govern the behavior of those who participate inits work and publicly document the licensing commitment of allparticipants.


Known Risks:

* Orphaned Software:

UIMA has been in active development for 5 years. The community of usershas steadily grown, and there are now significant commercial andresearch organizations actively using it. UIMA is embedded in IBMsoftware products and is delivered through IBM services engagements. IBMhas developers assigned to it, and is continuing to support itsdevelopment. In addition, several people outside of IBM have alreadyexpressed interest in working on UIMA, and have been providing IBM withinitial feedback. One of the objectives of starting this Apache projectis to provide a meritocratic structure for those people to begin moreactively contributing to UIMA.


* Inexperience with Open Source:

The individuals working on this software have background as IBM softwaredevelopers. While many of them have experience working with open sourcesoftware, none of them has had extensive experience contributing toother open source software. However, IBM as an organization hasextensive experience contributing to open source projects and will makeavailable resources to provide guidance to the developers working onthis project.


* Homogenous Developers: (work for same company?)

Currently all the developers work for IBM, although they come fromdifferent geographically dispersed organizations within IBM. We willreach out during the incubation time to get others to contribute; wehave already received interest from several parties.


* Reliance on salaried developers:
Currently the developers are paid employees of IBM.

* Relationships with Other Apache Products:

We make use of several Apache components (SOAP / Web Services, XML(Xerces, Xalan), languages (Perl), scripting languages (ANT), ApachePortable Runtime. In addition, UIMA has been embedded in otherframeworks, such as web application servers, and integrated with searchengines. We are exploring Lucene extensions that could take advantageof UIMA processed data. We are currently investigating and prototypingsome software packaging concepts based on OSGi; the Apache Incubatorproject Felix may have relevance as we go forward. The documentation isbeing moved to XHTML and plans to use Apache FOP for producing PDFreference materials.


* An Excessive Fascination with the Apache Brand

UIMA is already being adopted by a wide cross section of users, bothcommercial and academic, world-wide. Our experience shows that analyticmodules can be reused and combined through UIMA making it easier andfaster for developers to build new analytic applications for specificindustries or domains. Given the diversity of content and analytics thatwill be required to address the multitude of opportunities - frommilitary intelligence to quality assurance to contact center analytics-- growing this infrastructure so that it better aligns with other majorOpen Source communities should help accelerate industry's ability to getvalue from content assets.

We believe that the Apache community of developers has the experience,background, visibility, and synergistic resources to encourage andfoster a vibrant developer community around this project.



Documentation:

There is a combination Introduction, Conceptual Overview, Tutorial,Tools and Framework User's Guides and References, downloadable fromhttp://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference_2.0.pdf



Scope of the project:

The project will develop implementations of the UIMA architecture (whichis concurrently being submitted to the OASIS standards process),supporting the breadth of platforms that developers working in thisfield are using, including Java, C++, Perl, Python and TCL; andutilities and tooling to support component and application developersand assemblers / packagers. It will initially include the Java UIMAframework for UIMA Version 2 (you can see a snap shot of the Version 2release SourceForge; the delivered code would this code base plus normalincremental bug fixes and improvements), plus additional components(mainly documentation and test cases, which are not currently onSourceForge). Over time, the project is expected grow to includesupporting various embeddings and integrations with other Apachecomponents such as search engines and web application frameworks.Over time, we envision the project becoming an umbrella for relatedopen-source around UIMA, including things like open-source pre-annotatedcorpora, and hosting a facility similar to the Lucene Sandbox toencourage innovation and experimentation.

The UIMA framework is primarily a set of libraries (in Java, C++, Perl,etc.), test cases, and UIMA utilities and tools (scripts, plugins,executables, etc.) used to build, test and debug UIMA analyticcomponents. The tooling includes several Eclipse platform plugins.


* Initial source

The source currently is maintained in IBM internal software controlsystems, with a copy of each release placed on SourceForge. At the timeof launch, we plan to contribute the latest version of the code base(with some renaming of package prefixes to reflect apache.org), testcases, build files, and documentation, under the terms specified in theASF Corporate Contributor License. We plan to donate the existing C++enablement layer and the support for Perl, Python, and TCL a few monthslater than the initial donation; this delay is to give us time to finishpreparing that code base for Open Source.


* ASF resources to be created
Mailing lists:
   * uima-dev
   * uima-commits

* uima-user (we already have a substantial user community and expectthem to turn up at Apache soon after we've hopefully been accepted intothe incubator)

For other resources such as Subversion repository, JIRA etc. we hope forguidance from our mentors.


* Initial Set of Committers
Michael Baessler ([EMAIL PROTECTED])
Edward Epstein ([EMAIL PROTECTED])
Thilo Goetz  ([EMAIL PROTECTED])
Adam Lally  ([EMAIL PROTECTED])
Marshall Schor ([EMAIL PROTECTED])

* Sponsor:

We are requesting the Incubator to sponsor this. Our current vision isthat it will become a top level project (other projects that developUIMA components could become subprojects, for instance).


* Mentors:
Sam Ruby ([EMAIL PROTECTED])
Ken Coar ([EMAIL PROTECTED])
Ian Holsman ([EMAIL PROTECTED])

* Section 6: Open Issues for Discussion

-Marshall Schor

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[PROPOSAL] UIMA (Unstructured Information Management Architecture) Framework

Reply via email to