Dear Mark and Sukhendu, Thank you very much for your advice, I will look at the ways you guys mentioned.
2012/6/9 Sukhendu Chakraborty <sukhendu.chakrabo...@gmail.com> > If you are interested, you can also look at Apache hama which provides an > MPI like interface on top of hadoop map-reduce. > > http://incubator.apache.org/hama/ > On Jun 8, 2012 4:55 PM, "Mark Grover" <grover.markgro...@gmail.com> wrote: > >> Hi Jason, >> Hive does expose a JDBC interface which can by tools and applications. >> You would check out individual tools to see if they support Hadoop (I use >> the word Hadoop and not Hive since an application doesn't need Hive to run >> Map Reduce jobs on data in HDFS). >> >> Apache Mahout, as Sreenath, mentioned is also an interesting open source >> project which combines canonical machine learning algorithms with the power >> of Hadoop. That might fit your bill too. >> >> Good luck, >> Mark >> >> On Fri, Jun 8, 2012 at 1:25 AM, jason Yang <lin.yang.ja...@gmail.com>wrote: >> >>> Hi, Mark. >>> >>> Thank you for your reply. >>> >>> I have read the User Guide, but I'm still wondering what can I do for >>> the following scenario: >>> ---- >>> 1. Suppose I have a table t_customer_info in Hive, which include lots >>> of information about our customers. >>> 2. Now I would like to cluster those customers into different groups so >>> that customers within a group have high similarity, but are very dissimilar >>> to customers in other groups. >>> 3. This is a classical clustering problem in Data Mining field, I >>> thought such job can not be done by query language, instead of some data >>> mining algorithms. >>> ---- >>> >>> When we look "back" to the traditional DBMS, there're lots of data >>> mining tools or BI tools which could connect to the DBMS, and apply some >>> canonical algorithms to the data in the DBMS. So I start to wonder is there >>> similar tools over Hive? >>> >>> If not, what's the most used way to do data mining over Hadoop? >>> >>> 2012/6/8 Mark Grover <grover.markgro...@gmail.com> >>> >>>> Hi Jason, >>>> Hive is a data warehouse system that sits on top of Hadoop. The key >>>> selling point here is that it allows users to write SQL-like queries to >>>> query their large scale data. These queries get compiled into Map Reduce >>>> which is then run on the Hadoop cluster just like any other Map Reduce >>>> jobs. >>>> >>>> Hadoop does all the parallel processing for you. All you have to do is >>>> set up a Hadoop cluster, install Hive on the cluster and run your Hive >>>> queries. All underlying processing will happen in parallel where possible. >>>> >>>> This is a good place to get started and learn more about Hive: >>>> https://cwiki.apache.org/confluence/display/Hive/GettingStarted >>>> >>>> Welcome and good luck! >>>> >>>> Mark >>>> >>>> >>>> On Thu, Jun 7, 2012 at 10:10 PM, jason Yang >>>> <lin.yang.ja...@gmail.com>wrote: >>>> >>>>> Hi, dear friends. >>>>> >>>>> I was wondering what's the popular way to do data mining on Hive? >>>>> >>>>> Since the data in Hive is distributed over the cluster, is there any >>>>> tool or solution could parallelize the data mining? >>>>> >>>>> Any suggestion would be appreciated. >>>>> >>>>> -- >>>>> YANG, Lin >>>>> >>>>> >>>> >>> >>> >>> -- >>> YANG, Lin >>> >>> >> -- YANG, Lin