aggregateByKey on PairRDD

2016-03-29 Thread Suniti Singh
Hi All, I have an RDD having the data in the following form : tempRDD: RDD[(String, (String, String))] (brand , (product, key)) ("amazon",("book1","tech")) ("eBay",("book1","tech")) ("barns&noble",("book","tech")) ("amazon",("book2","tech")) I would like to group the data by Brand and wou

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
further processing. I am kind of stuck. On Tue, Mar 15, 2016 at 10:50 AM, Suniti Singh wrote: > Is it always the case that one title is a substring of another ? -- Not > always. One title can have values like D.O.C, doctor_{areacode}, > doc_{dep,areacode} > > On Mon, Mar 14, 2

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
at one title is a substring of another ? > > On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh > wrote: > >> Hi All, >> >> I have two tables with same schema but different data. I have to join the >> tables based on one column and then do a group by the same column name

Compare a column in two different tables/find the distance between column data

2016-03-14 Thread Suniti Singh
Hi All, I have two tables with same schema but different data. I have to join the tables based on one column and then do a group by the same column name. now the data in that column in two table might/might not exactly match. (Ex - column name is "title". Table1. title = "doctor" and Table2. ti

Re: Adding hive context gives error

2016-03-07 Thread Suniti Singh
Hi Suniti, > > why are you mixing spark-sql version 1.2.0 with spark-core, spark-hive v > 1.6.0? > > I’d suggest you try to keep all the libs at the same version. > > On Mar 7, 2016, at 6:15 PM, Suniti Singh wrote: > > > > org.apache.spa

Adding hive context gives error

2016-03-07 Thread Suniti Singh
Hi All, I am trying to create a hive context in a scala prog as follows in eclipse: Note -- i have added the maven dependency for spark -core , hive , and sql. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object D