Re: Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA

Marshall Schor Fri, 25 Aug 2006 10:25:45 -0700

Hi Leo,

Here's a response to your good questions; apologies to you and others(David Welton also commented) that we were not clearer, initially.


>
> <snip> I understand there's a runtime and
> a framework and a standardization process and a component-based

> interoperability goal, but what I don't understand is what they are*for*.

>
> <snip> outline what problem this UIMA thing is meant to solve

We are working to move the community of folks who write unstructuredanalysis software to a place where these things can be easily puttogether. Here's some examples of the kind of software we mean:* code that works with text and identifies words or phrases asparticular kinds of entities, such as persons, places, organizations,chemical names, times, dates, telephone numbers, or sentiment analysis(e.g likes, hates), etc.

* code that works with audio samples and extract text (think of speechrecognition)


* code that translates text in one language to another

* code that finds similarities among images, or computes a similarityscore for images

Today, you can find these components bundled up inside varioussolutions, etc. Our goal was to enable putting together these partsinto interesting new kinds of applications. For instance:

* an application that takes several approaches to speech recognition, ormachine translation, and combines them (in the hope of getting a betterquality result)

* a search application that wants to search for "concepts" as well askey-words, where these concepts have been added by pre-processing thedata being searched with a set of these components. You might imaginethe set of components used would depend on the kind of searching - ifyou were working with genetics, you might have textual information about"genes" identified, or you might be searching for a kind of image, alongwith some text.

* A "business intelligence" (not an oxymoron :-) (sorry for thebuzzword) application that looks for trends in various kinds of messages- using components that might tag texts with concepts like "positiveremark" or "negative remark".


---------------

> What does it *do*?

It runs components that are written to conform to its architecture(remote or local, in several computer languages) in a flow, which isconfigured with external-to-the-component-part configuration files.

> How does it *work*?
> <snip> outline what the approach is to solving that problem

Components work with one another by adopting a model where data ispassed from component to component. Each component examines the data,runs its particular unstructured analysis special capability, and addsto the data. Components are required to specify descriptive informationwhich UIMA uses in common tooling and running.Common services needed by components and solutions built with thesecomponents are provided.


---------------

> <snip> outline how this turns into software

The data passed from part to part is described with a single-inheritencetype system. Support is provided for parts written in Java, C++, andsome scripting languages (Perl, Python, Tcl). A variety of tradeoffs inprogramming styles are supported, from Java centric object-orientedstyles, to styles that go for very high performance and avoid creatingJava "objects". Common services to process very large collections ofthings through the flow of components, with pragmatic error handling andrecovery, are provided.


---------------

> <snip> give an example or even two of such software in use in thereal world to

>    solve some kind of tangible problem

As part of a company's monitoring of it's outgoing Email (say, as partof efforts to comply with Sarbanes-Oxley Act), it could deploycomponents that detect a variety of named entities (persons, places,organizations, etc.) and relationships among these. The company coulddeploy commercially available components, plus some customized for theirparticular domain. (As an example, a recent announcement of somecommercially available components can be seen athttp://be.sys-con.com/read/262873.htm )

Another example might involve an insurance brokerage whose employeesneed access to information from insurance policies, notes from fieldadjusters, emails from customers, etc. To enable more focussedsearching, these data could be augmented with the results ofunstructured information analysis, and the resulting "structured"information could be used in searching. The company might want tointegrate commercial, more generic named entity detectors, with specificrecognizers for their particular needs. Search could be performed withengines capable of searching both using traditional key-words, as wellas looking for concepts (added by the various UIMA parts), and key-wordscontained within the span of particular concepts. [Note: these searchengines already exist]. In addition to search, the broker might processthis information, looking for early indication of potential issues bynoticing trends in various kinds of things being reported.

---------------------

We're not trying to re-invent the semantic web movement. However, wethink UIMA might enable some aspects of it, by allowing the flourishingof a rich set of unstructured information analysis components, withinthe community.


-Marshall

Leo Simons wrote:

Hi Marshall!

I'm sure all this is potentially interesting, but you're going to have
to help us understand why.

On Wed, Aug 23, 2006 at 03:21:55PM -0400, Marshall Schor wrote:
Proposal for Incubation Project: Unstructured Information ManagementArchitecture - UIMA
The Unstructured Information Management Architecture (UIMA) is anarchitecture and software framework for creating, discovering, composingand deploying a broad range of multi-modal analysis capabilities. Wepropose a project to develop, implement, support and enhance UIMAframework implementations that comply with the UIMA standard (being putforward concurrently for standardization within OASIShttp://www.oasis-open.org - not yet submitted, but we plan to do thisearly in September.).
<snip/>
Motivation for UIMA: Databases are core components of nearly allapplications; they store information in structured tables. But more andmore of the available digital data is unstructured (e.g. email, webdocuments, images, audio clips, video streams) with little information(metadata) attached to explain its content or context. Although manyapplications have been built to process unstructured data, they haveeither managed it as a BLOB or they have developed isolated applicationsfor analyzing the content. In the absence of a standardized means foranalytical applications to share insights extracted from the content,analytical applications cannot build upon one another. As a result, theindustry has barely begun to tap the value locked in unstructuredinformation.
<snip/>

What does it *do*? How does it *work*? I understand there's a runtime and
a framework and a standardization process and a component-based
interoperability goal, but what I don't understand is what they are *for*.

Can you please write a paragraph or two, that

1) doesn't mention "what the industry is doing" or needs to do
2) doesn't mention frameworks, standards, or current problematic
   industry practices, SOA, SOAP, DARPA, OASIS, or other acronyms
3) outlines what problem this UIMA thing is meant to solve
4) outlines what the approach is to solving that problem
5) outlines how this turns into software
6) gives an example or even two of such software in use in the real world to
   solve some kind of tangible problem

For example, one kind of "unstructured information" is "the web", and one
way to process that is "as plain text, indexing it, create a keyword-based 
search
engine", and then there's also fancier ways such as all the things that google
does. And then there's also various ways to make the unstructured mess that is 
the
web more structured by attaching metadata, eg dublin core metadata or the whwole
semantic web thing, so right now I might walk away with the understanding that
you're devising a way for google and yahoo to interop (which I doubt they really
want) by re-inventing the semantic web movement (which I doubt is really
productive). Enlighten me, please. If it helps, imagine I'm 12 and write PHP andhave difficulty with words such as interoperability since English is not my first
language.

cheers,

LSD

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA

Reply via email to