Hi,

I am looking at how to use mahout for web page categorization.

Idea is to have various categories like

Adult
Arts
Business
Computers
Games
Health
Home
Kids
News
Recreation
Reference
Science
Shopping
Society
Sports

and classify given web page into specific category.

After going through some paper on related topics, they suggest to do

Pre processing
- remove html tags
- remove stop words : "stop list"
- remove rare words :
- perform word stemming: Porter stemmer is well known algo

Research paper suggest to have various approaches like Subject
classification based on title and functional classification based on
contents even html elements, image alt text, link anchor texts etc.

I need to first finalize on list of categories and then prepare list of
keywords for each category.

My question in how mahout could be used for this purpose, I see example
with mahout that shows classification of 20news groups using naive bays.
However I am not sure about how I could make use of keywords in this case.

Are there some examples that show how mahout could be used to pre preocess
and do stemming.

Thanks,
Rajesh

Reply via email to