[Vote] accept UIMA as a podling - #2

Ian Holsman Tue, 26 Sep 2006 16:21:02 -0700


issues addressed in this release:
1. updated proposal included
2. The first paragraph explains it to a layperson
3. OASIS issue addressed


[ ] +1 Accept UIMA as an Incubator podling
[ ]   0 Don't care
[ ] -1 Reject this proposal for the following reason:


----8<-------Proposal------8<------


Hello everyone -

We are submitting this proposal to the community for a
new project in the incubator, and look forward to starting to work with
this community.

This is a slightly modified and extended version of the proposal thathasalready been posted to [EMAIL PROTECTED] The whole mailthreadcan be found [http://www.nabble.com/Proposal-for-a-new-incubation-project%3A-Unstructured-Information-Management-Architecture---UIMA-tf2154324.html here].

If you don't feel like reading the whole thread, the main questionthat came up was:this is all very well, but what does it really '''do'''? Attempts toanswer that questionwhere made [http://www.nabble.com/Re%3A-Proposal-for-a-new-incubation-project%3A-Unstructured-Information-Management-Architecture---UIMA-p5986403.html here] and [http://www.nabble.com/Re%3A-Proposal-for-a-new-incubation-project%3A-Unstructured-Information-Management-Architecture---UIMA-p5987788.html here]. We have since worked someof these into the proposal itself.


----

= Proposal for Incubation Project: Unstructured InformationManagement Architecture - UIMA =


== Abstract ==

UIMA is a component framework for the analysis of unstructuredcontent such as text, audio and video. It comprises an SDK andtooling for composing and running analytic components written in Javaand C++.

== Proposal: Unstructured Information Management Architectureframework ==

Unstructured Information Management applications are software systemsthat analyze large volumes of unstructured information in order todiscover knowledge that is relevant to an end user. We propose UIMA,a framework and SDK for developing such applications. An example UIMapplication might ingest plain text and identify entities, such aspersons, places, organizations; or relations, such as works-for orlocated-at. UIMA enables such an application to be decomposed intocomponents, for example ''"language identification"'' -> ''"languagespecific segmentation"'' -> ''"sentence boundary detection"'' ->''"entity detection (person/place names etc.)"''. Each componentmust implement interfaces defined by the framework and must provideself-describing metadata via XML descriptor files. The frameworkmanages these components and the data flow between them. Componentsare written in Java or C++; the data that flows between components isdesigned for efficient mapping between these languages. UIMAadditionally provides capabilities to wrap components as networkservices, and can scale to very large volumes by replicatingprocessing pipelines over a cluster of networked nodes.

This framework has already attracted a following among government,commercial, and academic institutions who previously developedanalysis algorithms, but were unable to easily build on each other'sworks, and who want to be able to evolve their applications byindependently upgrading parts, as better technology becomesavailable. Applications built with this framework are being usedwith plain text, audio streams, and image/video streams, identifyingentities and relations, converting speech to text, translating intodifferent languages, and determining properties of images.

The UIMA framework runs components in a flow, passing a common dataobject containing unstructured information (free text, audio, video,etc.) through the components. Each component examines theunstructured information and data added by other components, and addsdata of its own. The framework mandates a standardized form of thedata being passed, and a standardized form of the interfaces to thecomponents.

We propose a project to develop, implement, support and enhance thisframework (and, over time, other implementations) that comply withthe UIMA standard (which has been submitted for standardization workwithin [http://www.oasis-open.org OASIS]. Members of this communityare encouraged to participate in that effort, as well; OASIS has anopen approach to granting Technical Committee voting rights tomembers of OASIS, described here: http://www.oasis-open.org/committees/process.php#2.4.

The proposal includes both the framework, as well as tools todevelop, describe, compose and deploy UIMA-based components andapplications. The initial work will be based on the UIMA Version 2framework code developed by IBM; snapshots of each release of thiscode are currently made available on [http://sourceforge.net/projects/uima-framework SourceForge]. The Source``Forge versions would bestabilized in maintenance mode, if we are successful in moving toApache.The framework is not specific to any IDE or platform, and does notdepend on other middleware.

Background:

Databases are core components of nearly all applications; they storeinformation in structured tables. But more and more of the availabledigital data is unstructured (e.g. email, web documents, images,audio clips, video streams) with little information (metadata)attached to explain its content or context. Although manyapplications have been built to process unstructured data, they haveeither managed it as a BLOB or they have developed isolatedapplications for analyzing the content. In the absence of astandardized means for analytical applications to share insightsextracted from the content, analytical applications cannot build uponone another. As a result, the industry has barely begun to tap thevalue locked in unstructured information.

Standardization is key to achieving component interoperability, withcapabilities to mix components developed in different places and inJava, C++ and other languages. The Unstructured InformationManagement Architecture defines standards for componentinteroperability and application composition that will provide thisneeded unifying standard, and allow a variety of frameworkimplementations to exist, while preserving the goal of unstructuredinformation analytic component reuse.


UIMA was built to help developers create solutions that get more value
from unstructured information more quickly and at lower cost by making

it easy to reuse and combine analytic modules from different sourcesinto new analytic applications. The architecture and the frameworkhave been validated through work with USA's DARPA which is using itas a standard for key projects with several universities involved inadvanced linguistics analysis, such as Carnegie Mellon, Columbia,Stanford and University of Massachusetts. Other companies, such asthe Mayo Clinic and Sloan Kettering, are also building efforts aroundUIMA. In addition, over 15 software vendors, including companiessuch as Inxight, Attensity, Clear``Forest, Temis, SPSS, SAS, Cognos,Endeca, Factiva and others, announced plans to support UIMA.

The UIMA framework (binary and/or source code) has been downloadedover 8000 times from IBM alphaWorks (http://www.alphaworks.ibm.com/tech/uima) or Source``Forge (http://uima-framework.sourceforge.net).


== Rationale ==

We believe that moving the UIMA framework development to the Apachedevelopment community will lead to faster innovation, betterintegration with other open source software, and broader adoption ofUIMA, accelerating the industry's ability to get the most value fromtext, audio, and video content. The UIMA framework is becomingattractive to developers who want to build components; we believethat having UIMA on Apache will encourage the development of a basicset of open source components that will jumpstart these developers'efforts. One of the first components we see possible synergy with isa search component based on Apache Lucene that would enable semanticsearch. We like the concept of the Lucene Sandbox as a way toencourage innovation around UIMA, and would envision somethingsimilar for this project.



== Initial Goals ==

Some initial work we see in the incubator includes the following:

* redoing the parts of the tooling that were done as derivative worksof Eclipse source code, to

enable everything to be licensable under the Apache license
* extending the framework to better support "scale-out"

* extending the framework to better align with the emerging UIMAStandards work* extending the framework to support XMI-based SOAP and/or otherservice interfaces* extending the framework to support OSGi-based approaches tocomponentization and packaging* exploring embeddings of the framework within other interestedApache projects, including synergies with Lucene* providing aids to the community to migrate from previous versionsof the framework to the Apache version* setting up community support: hosting a facility similar to theLucene Sandbox to encourage innovation andexperimentation; establishing a wiki and some process to allow betterdocumentation to be developed by the community,and linking our existing XHTML documentation via an XSL transform toApache FOP



== Current Status ==
=== Meritocracy ===

Meritocracy seems to us an ideal way to grow the community ofdevelopers around UIMA, it being a controlled, rational way to givethose who positively contribute, more ability to directlycontribute. This approach also gives contributors one of the bestreasons to join the community of volunteers - to be recognized forthe merit of their contributions.


=== Community ===

Currently, the UIMA Framework development is being done by IBM, withinput from a group of early adopters in industry and government.Going forward, we see IBM continuing to support several committersworking on it. We have already begun talking with other peopleoutside of IBM that have expressed interest in contributing towardsthe development. This includes members of academic institutions,people working for some of the software vendors that have announcedplans to support UIMA, and others from companies that have expressedinterest since initial announcements about our open source plans.Multiple non-IBM people have already expressed desires to becomecommitters.


=== Core Developers ===

The previous core developers of UIMA are Adam Lally, Thilo Goetz,Marshall Schor, Edward Epstein, Jaroslaw Cwiklik and Thomas Hampp.Many others have also contributed. The developers come from both theResearch and Development parts of IBM.


=== Alignment ===

UIMA has significant synergy with search applications, and we expectto see integration with Lucene in the future. UIMA makes use of theApache Portable Runtime (APR) for C++ support. It is designed to beembeddable into other frameworks, such as web application servers.Part of UIMA is Eclipse-based tooling. We use ANT for buildscripting. UIMA has support for various language bindings includingC++ and Java; we also have more limited bindings for Perl, Python,and TCL. UIMA uses Web Services as part of its approach to wiring upcomponents in its domain. It makes use of XML services such asXerces and Xalan.

The development of UIMA has been based on merit with open discussionamong a distributed team of developers, from both Research andDevelopment organizations.


=== License ===

The current license for the source code is CPL, with a small numberof files licensed under the EPL (Eclipse Public License), becausethese were created as "derivative works" of existing Eclipse opensource code. When the code base is moved to Apache, it will berelicensed under the Apache license, except for the small number offiles licensed under the EPL as derivative works of Eclipse sourcefiles. We plan to work in the incubator to redo these parts, so theentire offering can be licensed under the Apache license.

The distribution for the C++ enablement layer includes open sourcecomponents ICU (a Unicode package) which has its own license. Weplan to work with community to properly make use of this non-Apachelicensed component. Our current vision for the future of UIMA has italigning with and incorporating other standards-based open sourcecomponents/protocols, some of which may have licensing other than theApache license (for example, the Xml Metadata Interchange (XMI), andthe EMF ECore Model from Eclipse); we will work with the community infiguring out how to move forward on this.


=== Other IP ===

When we requested OASIS to set up a Technical Committee chartered todevelop a platform-independent specification for text and multi-modalanalysis, we specified that it be set up under the "RF on LimitedTerms" mode of the OASIS IP Policy. "RF" means Royalty Free, and theLimited Terms means companies that are working with us on theTechnical Committee are restricted in adding additional terms.

These are the most liberal terms and make any Essential Claimsavailable to ALL and ROYALTY FREE.

For the details please refer to:

* http://www.oasis-open.org/who/ipr/ipr_faq.php
* http://www.oasis-open.org/who/intellectualproperty.php

Ultimately of course, there is always a risk that someone in theworld holds a patent that can be claimed as Essential. The most anystandards organization can do is govern the behavior of those whoparticipate in its work and publicly document the licensingcommitment of all participants.


== Known Risks ==

=== Orphaned Software ===

UIMA has been in active development for 5 years. The community ofusers has steadily grown, and there are now significant commercialand research organizations actively using it. UIMA is embedded inIBM software products and is delivered through IBM servicesengagements. IBM has developers assigned to it, and is continuing tosupport its development. In addition, several people outside of IBMhave already expressed interest in working on UIMA, and have beenproviding IBM with initial feedback. One of the objectives ofstarting this Apache project is to provide a meritocratic structurefor those people to begin more actively contributing to UIMA.


=== Inexperience with Open Source ===

The individuals working on this software have background as IBMsoftware developers. While many of them have experience working withopen source software, none of them has had extensive experiencecontributing to other open source software. However, IBM as anorganization has extensive experience contributing to open sourceprojects and will make available resources to provide guidance to thedevelopers working on this project.


=== Homogenous Developers (work for same company?) ===

Currently all the developers work for IBM, although they come fromdifferent geographically dispersed organizations within IBM. We willreach out during the incubation time to get others to contribute; wehave already received interest from several parties.


=== Reliance on salaried developers ===

Currently the developers are paid employees of IBM.

=== Relationships with Other Apache Products ===

We make use of several Apache components (SOAP / Web Services, XML(Xerces, Xalan), languages (Perl), scripting languages (ANT), ApachePortable Runtime. In addition, UIMA has been embedded in otherframeworks, such as web application servers, and integrated withsearch engines. We are exploring Lucene extensions that could takeadvantage of UIMA processed data. We are currently investigating andprototyping some software packaging concepts based on OSGi; theApache Incubator project Felix may have relevance as we go forward.The documentation is being moved to XHTML and plans to use Apache FOPfor producing PDF reference materials.


=== An Excessive Fascination with the Apache Brand ===

UIMA is already being adopted by a wide cross section of users, bothcommercial and academic, world-wide. Our experience shows thatanalytic modules can be reused and combined through UIMA making iteasier and faster for developers to build new analytic applicationsfor specific industries or domains. Given the diversity of contentand analytics that will be required to address the multitude ofopportunities - from military intelligence to quality assurance tocontact center analytics -- growing this infrastructure so that itbetter aligns with other major Open Source communities should helpaccelerate industry's ability to get value from content assets.

We believe that the Apache community of developers has theexperience, background, visibility, and synergistic resources toencourage and foster a vibrant developer community around this project.



== Documentation ==

There is a combination Introduction, Conceptual Overview, Tutorial,Tools and Framework User's Guides and References, downloadable fromhttp://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference_2.0.pdf



== Scope of the project ==

The project will develop implementations of the UIMA architecture(which is concurrently being submitted to the OASIS standardsprocess), supporting the breadth of platforms that developers workingin this field are using, including Java, C++, Perl, Python and TCL;and utilities and tooling to support component and applicationdevelopers and assemblers / packagers. It will initially include theJava UIMA framework for UIMA Version 2 (you can see a snap shot ofthe Version 2 release Source``Forge; the delivered code would thiscode base plus normal incremental bug fixes and improvements), plusadditional components (mainly documentation and test cases, which arenot currently on Source``Forge). Over time, the project is expectedgrow to include supporting various embeddings and integrations withother Apache components such as search engines and web applicationframeworks.

Over time, we envision the project becoming an umbrella for relatedopen-source around UIMA, including things like open-source pre-annotated corpora, and hosting a facility similar to the LuceneSandbox to encourage innovation and experimentation.

The UIMA framework is primarily a set of libraries (in Java, C++,Perl, etc.), test cases, and UIMA utilities and tools (scripts,plugins, executables, etc.) used to build, test and debug UIMAanalytic components. The tooling includes several Eclipse platformplugins.


== Initial source ==

The source currently is maintained in IBM internal software controlsystems, with a copy of each release placed on SourceForge. At thetime of launch, we plan to contribute the latest version of the codebase (with some renaming of package prefixes to reflect apache.org),test cases, build files, and documentation, under the terms specifiedin the ASF Corporate Contributor License. We plan to donate theexisting C++ enablement layer and the support for Perl, Python, andTCL a few months later than the initial donation; this delay is togive us time to finish preparing that code base for Open Source.


== ASF resources to be created ==

Mailing lists:
* uima-dev
* uima-commits

* uima-user (we already have a substantial user community and expectthem to turn up at Apache

soon after we've hopefully been accepted into the incubator)

For other resources such as Subversion repository, JIRA etc. we hopefor guidance from our mentors.


== Initial Set of Committers ==

* Michael Baessler ([EMAIL PROTECTED])
* Edward Epstein ([EMAIL PROTECTED])
* Thilo Goetz  ([EMAIL PROTECTED])
* Adam Lally  ([EMAIL PROTECTED])
* Marshall Schor ([EMAIL PROTECTED])

=== Sponsor ===

We are requesting the Incubator to sponsor this. Our current visionis that it will become a top level project (other projects thatdevelop UIMA components could become subprojects, for instance).


=== Mentors ===

* Sam Ruby ([EMAIL PROTECTED])
* Ken Coar ([EMAIL PROTECTED])
* Ian Holsman ([EMAIL PROTECTED])

=== Section 6: Open Issues for Discussion ===


--
Ian Holsman
[EMAIL PROTECTED]
http://garden-gossip.com/ -- what's in your garden?

[Vote] accept UIMA as a podling - #2

Reply via email to