If you are interested, you can also look at Apache hama which provides an MPI like interface on top of hadoop map-reduce.
http://incubator.apache.org/hama/ On Jun 8, 2012 4:55 PM, "Mark Grover" <grover.markgro...@gmail.com> wrote: > Hi Jason, > Hive does expose a JDBC interface which can by tools and applications. You > would check out individual tools to see if they support Hadoop (I use the > word Hadoop and not Hive since an application doesn't need Hive to run Map > Reduce jobs on data in HDFS). > > Apache Mahout, as Sreenath, mentioned is also an interesting open source > project which combines canonical machine learning algorithms with the power > of Hadoop. That might fit your bill too. > > Good luck, > Mark > > On Fri, Jun 8, 2012 at 1:25 AM, jason Yang <lin.yang.ja...@gmail.com>wrote: > >> Hi, Mark. >> >> Thank you for your reply. >> >> I have read the User Guide, but I'm still wondering what can I do for the >> following scenario: >> ---- >> 1. Suppose I have a table t_customer_info in Hive, which include lots >> of information about our customers. >> 2. Now I would like to cluster those customers into different groups so >> that customers within a group have high similarity, but are very dissimilar >> to customers in other groups. >> 3. This is a classical clustering problem in Data Mining field, I thought >> such job can not be done by query language, instead of some data mining >> algorithms. >> ---- >> >> When we look "back" to the traditional DBMS, there're lots of data mining >> tools or BI tools which could connect to the DBMS, and apply some canonical >> algorithms to the data in the DBMS. So I start to wonder is there similar >> tools over Hive? >> >> If not, what's the most used way to do data mining over Hadoop? >> >> 2012/6/8 Mark Grover <grover.markgro...@gmail.com> >> >>> Hi Jason, >>> Hive is a data warehouse system that sits on top of Hadoop. The key >>> selling point here is that it allows users to write SQL-like queries to >>> query their large scale data. These queries get compiled into Map Reduce >>> which is then run on the Hadoop cluster just like any other Map Reduce jobs. >>> >>> Hadoop does all the parallel processing for you. All you have to do is >>> set up a Hadoop cluster, install Hive on the cluster and run your Hive >>> queries. All underlying processing will happen in parallel where possible. >>> >>> This is a good place to get started and learn more about Hive: >>> https://cwiki.apache.org/confluence/display/Hive/GettingStarted >>> >>> Welcome and good luck! >>> >>> Mark >>> >>> >>> On Thu, Jun 7, 2012 at 10:10 PM, jason Yang <lin.yang.ja...@gmail.com>wrote: >>> >>>> Hi, dear friends. >>>> >>>> I was wondering what's the popular way to do data mining on Hive? >>>> >>>> Since the data in Hive is distributed over the cluster, is there any >>>> tool or solution could parallelize the data mining? >>>> >>>> Any suggestion would be appreciated. >>>> >>>> -- >>>> YANG, Lin >>>> >>>> >>> >> >> >> -- >> YANG, Lin >> >> >