Hi Thilo
your explanation attracted me ;-)
is UIMA just the interface specification only ? (ie to produce a
standard in the unstructured text-processing world so that other
people can plug and play)
or does UIMA also provide tools for each component?
I'm interested, and time permitting, could help as a mentor .. I'm
not a java expert (compared to others on this list), or a text
processing expert, but I know
a bit about the processes around the incubator.
regards
Ian
On 26/08/2006, at 2:04 AM, Thilo Goetz wrote:
Leo Simons wrote:
<snip/>
What does it *do*? How does it *work*? I understand there's a
runtime and
a framework and a standardization process and a component-based
interoperability goal, but what I don't understand is what they
are *for*.
The unstructured content we're talking about is mainly plain text
today. There is also some work going on analyzing video streams,
as well as multi-modal streams (e.g., video + closed captioning).
I'm not really competent to talk about those, so I'll stick to
text. A typical processing chain for text analysis starts out
something like this:
"language identification" -> "language specific segmentation" ->
"sentence boundary detection" -> "entity detection (person/place
names etc.)" -> ...
So you start by identifying the language the text is in (Chinese,
English etc.). Then you do token segmentation based on that
information (it's completely different for Chinese than for
English). Based on the tokens you discovered, you may want to do
sentence boundary detection, so you know what entities occur in the
same sentence. Then, again based on the tokens you've found, you
can do so-called named entity detection, such as place names,
person names etc. After that, you may have another module that can
discover relations between the entities that you have found. And
so on.
UIMA in its core is a component architecture that allows you to
create analysis applications like the one described above. It
provides facilities for creating meta-information on documents like
in the example above. That is, the original artifact (i.e., the
text) is not modified and the derived information is kept separately.
UIMA is mostly a framework, not an application. So it is not
concerned with fetching documents, like the crawler of a search
engine. Nor does UIMA provide facilities to do very much with the
information you have extracted from the text (or other artifact).
Rather, the use case is that you have an application that has a
need for the processing of unstructured information. This
application will provide the input data, and it will know what to
do with the results. The value of UIMA derives from the component
model: it is easy to reuse existing analysis components that other
people have written, and it's easy to exchange, say, one language
identifier for another.
One standard application scenario is to use UIMA to extract some
named entities from text, feed the results into a relational
database, and use the database's mining capabilities to do, e.g.,
association analysis. Another area of application is enhanced text
search, where in addition to regular free-form text search, you can
search for documents containing certain entities. Trivial standard
example: you're looking for John's phone number in your email, so
you use semantic search to look for documents that contain John's
name and a phone number. You'll use a UIMA component that knows
that a pattern 123-456-7890 is a phone number and will create a
phone number entity.
I hope this gives you a better idea what UIMA is about.
--Thilo
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Ian Holsman
[EMAIL PROTECTED]
http://personalinjuryfocus.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]