Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Sorry, I had a typo I mean repartitionby("fieldofjoin) El 2 may. 2017 9:44 p. m., "KhajaAsmath Mohammed" escribió: Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry) import sqlContext._

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry) import sqlContext._ import sqlContext.implicits._ datapoint_prq_df.join(geoCacheLoc_df) Val tableA = DfA.partitionby("joinField").f

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Have you tried to make partition by join's field and run it by segments, filtering both tables at the same segments of data? Example: Val tableA = DfA.partitionby("joinField").filter("firstSegment") Val tableB= DfB.partitionby("joinField").filter("firstSegment") TableA.join(TableB) El 2 may

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is for one month i.e. for April Table 2: 92 GB not partitioned . I have to perform join on these tables now. On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Hello, > > Is the

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Hello, Is the tables partitioned? If yes, what is the partition field? Thanks El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" escribió: Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1: 192GB Table 2: 92 GB Does any

Re: Joins in Spark

2016-03-19 Thread Rishi Mishra
My suspect is your input file partitions are small. Hence small number of tasks are started. Can you provide some more details like how you load the files and how the result size is around 500GBs ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/ri

Re: Joins in Spark

2014-12-22 Thread madhu phatak
Hi, You can map your vertices rdd as follow val pairVertices = verticesRDD.map(vertice => (vertice,null)) the above gives you a pairRDD. After join make sure that you remove superfluous null value. On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan wrote: > Hi, > I have two RDDs, vertices and edg