Look at LingPipe from Alias-i.com.  Look at Named Entity extraction and its 
classifiers.

Otis


----- Original Message ----
From: Vladimir Olenin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, September 25, 2006 9:49:31 PM
Subject: does anyone know of a 'smart' categorizing text pattern finder?


Hi,

I wonder if anyone here knows if there is a 'smart' text pattern finder, 
ideally written in Java. The library I'm looking for should be able to 'guess' 
the category of the particular text on the page, most probably by finding 
similarities between the bulk of the pages and a set of templates.

Eg, many forums are powered by phpbb, which structures 99% of the pages (except 
for some title pages & user profile pages) in a very similar fashion (page is 
broken into blocks, each block is broken into further blocks, etc). By 
comparing many pages with each other (eg, from the same domain root: 
forum.springframework.org) it should be possible to detect common ('template 
decorations') and page specific (actual content, like 'user name' and 'posting 
body') parts. After that it should further be possible, by comparing 'template 
decorations' parts to a set of templates, to 'guess' the nature of each of the 
'page specific' block (eg, 'Vladimir Olenin' in the left side column will be 
marked as 'name', while whatever is adjucent to this column is the post body).

So, I wonder if anyone knows of a package capable of such things. Primary goal 
though is simplier: to be able to parse out just posters' names from message 
boards. Though sometimes the 'block category' can be derived from CSS class 
name of the tags around the text, it's very often not the case.

Might Nutch have similar functionality built into their crawler?

Thanks.

Vlad




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to