Hi Thilo
your explanation attracted me ;-)

is UIMA just the interface specification only ? (ie to produce a standard in the unstructured text-processing world so that other people can plug and play)
or does UIMA also provide tools for each component?

I'm interested, and time permitting, could help as a mentor .. I'm not a java expert (compared to others on this list), or a text processing expert, but I know
a bit about the processes around the incubator.

regards
Ian


On 26/08/2006, at 2:04 AM, Thilo Goetz wrote:

Leo Simons wrote:

<snip/>

What does it *do*? How does it *work*? I understand there's a runtime and
a framework and a standardization process and a component-based
interoperability goal, but what I don't understand is what they are *for*.

The unstructured content we're talking about is mainly plain text today. There is also some work going on analyzing video streams, as well as multi-modal streams (e.g., video + closed captioning). I'm not really competent to talk about those, so I'll stick to text. A typical processing chain for text analysis starts out something like this:

"language identification" -> "language specific segmentation" -> "sentence boundary detection" -> "entity detection (person/place names etc.)" -> ...

So you start by identifying the language the text is in (Chinese, English etc.). Then you do token segmentation based on that information (it's completely different for Chinese than for English). Based on the tokens you discovered, you may want to do sentence boundary detection, so you know what entities occur in the same sentence. Then, again based on the tokens you've found, you can do so-called named entity detection, such as place names, person names etc. After that, you may have another module that can discover relations between the entities that you have found. And so on.

UIMA in its core is a component architecture that allows you to create analysis applications like the one described above. It provides facilities for creating meta-information on documents like in the example above. That is, the original artifact (i.e., the text) is not modified and the derived information is kept separately.

UIMA is mostly a framework, not an application. So it is not concerned with fetching documents, like the crawler of a search engine. Nor does UIMA provide facilities to do very much with the information you have extracted from the text (or other artifact). Rather, the use case is that you have an application that has a need for the processing of unstructured information. This application will provide the input data, and it will know what to do with the results. The value of UIMA derives from the component model: it is easy to reuse existing analysis components that other people have written, and it's easy to exchange, say, one language identifier for another.

One standard application scenario is to use UIMA to extract some named entities from text, feed the results into a relational database, and use the database's mining capabilities to do, e.g., association analysis. Another area of application is enhanced text search, where in addition to regular free-form text search, you can search for documents containing certain entities. Trivial standard example: you're looking for John's phone number in your email, so you use semantic search to look for documents that contain John's name and a phone number. You'll use a UIMA component that knows that a pattern 123-456-7890 is a phone number and will create a phone number entity.

I hope this gives you a better idea what UIMA is about.

--Thilo


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Ian Holsman
[EMAIL PROTECTED]
http://personalinjuryfocus.com/




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to