Hi, I am looking at how to use mahout for web page categorization.
Idea is to have various categories like Adult Arts Business Computers Games Health Home Kids News Recreation Reference Science Shopping Society Sports and classify given web page into specific category. After going through some paper on related topics, they suggest to do Pre processing - remove html tags - remove stop words : "stop list" - remove rare words : - perform word stemming: Porter stemmer is well known algo Research paper suggest to have various approaches like Subject classification based on title and functional classification based on contents even html elements, image alt text, link anchor texts etc. I need to first finalize on list of categories and then prepare list of keywords for each category. My question in how mahout could be used for this purpose, I see example with mahout that shows classification of 20news groups using naive bays. However I am not sure about how I could make use of keywords in this case. Are there some examples that show how mahout could be used to pre preocess and do stemming. Thanks, Rajesh
