Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
So does not benefit from Project Tungsten right? On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin wrote: > It's a completely different path. > > > On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar wrote: > >> I would like to know if Hive on Spark uses or shares the execut

Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
I would like to know if Hive on Spark uses or shares the execution code with Spark SQL or DataFrames? More specifically, does Hive on Spark benefit from the changes made to Spark SQL, project Tungsten? Or is it completely different execution path where it creates its own plan and executes on RDD?

Code generation for GPU

2015-09-07 Thread lonikar
Hi, I am speaking in Spark Europe summit on exploiting GPUs for columnar DataFrame operations. I was going through various blogs, talks and JIRAs given by all you and trying to figure out where to make changes for this proposal. First of all, I must thank the recent progress in project tungsten t

Code generation for GPU

2015-09-07 Thread lonikar
Hi,I am speaking in Spark Europe summit on exploiting GPUs for columnar DataFrame operations. I was going through various blogs, talks and JIRAs given by all you and trying to figure out where to make changes for this proposal.First of all, I must thank the recent progress in project tungsten that

Fwd: Code generation for GPU

2015-09-03 Thread kiran lonikar
Hi, I am speaking in Spark Europe summit on exploiting GPUs for columnar DataFrame operations . I was going through various blogs, talks and JIRAs given by all the key spark folks and trying to figure out w

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
e soon. > > On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar wrote: > >> Possibly in future, if and when spark architecture allows workers to >> launch spark jobs (the functions passed to transformation or action APIs of >> RDD), it will be possible to have RDD of RDD.

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar wrote: > Simillar question was asked before: >

Re: Rdd of Rdds

2015-06-09 Thread lonikar
Replicating my answer to another question asked today: Here is one of the reasons why I think RDD[RDD[T]] is not possible: * RDD is only a handle to the actual data partitions. It has a reference/pointer to the /SparkContext /object (/sc/) and a list of partitions. * The SparkContext is an o

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object

Re: Optimisation advice for Avro->Parquet merge job

2015-06-08 Thread kiran lonikar
at 12:30 PM, kiran lonikar wrote: > James, > > As I can see, there are three distinct parts to your program: > >- for loop >- synchronized block >- final outputFrame.save statement > > Can you do a separate timing measurement by putting a simple > Sys

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
thod or a SQLContext method > returns a DataFrame or an RDD, then it is lazily evaluated, since DataFrame > and RDD are both lazily evaluated. > > Cheng > > > On 6/8/15 8:11 PM, kiran lonikar wrote: > > Thanks. Can you point me to a place in the documentation of SQL > pro

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
are also lazily evaluated. However, DataFrame > transformations like filter(), select(), agg() return a DataFrame rather > than an RDD. Other methods like show() and collect() are actions. > > Cheng > > On 6/8/15 1:33 PM, kiran lonikar wrote: > > Thanks for replying twice

Re: Column operation on Spark RDDs.

2015-06-08 Thread lonikar
Two simple suggestions: 1. No need to call zipWithIndex twice. Use the earlier RDD dt. 2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark job Below your code with the above changes: var dataRDD = sc.textFile("/test.csv").map(_.split(",")) val dt = dataRDD.*zipWithUniqueId

Re: Column operation on Spark RDDs.

2015-06-08 Thread kiran lonikar
Two simple suggestions: 1. No need to call zipWithIndex twice. Use the earlier RDD dt. 2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark job Below your code with the above changes: var dataRDD = sc.textFile("/test.csv").map(_.split(",")) val dt = dataRDD.*zipWithUniqueId

Re: Optimisation advice for Avro->Parquet merge job

2015-06-08 Thread kiran lonikar
James, As I can see, there are three distinct parts to your program: - for loop - synchronized block - final outputFrame.save statement Can you do a separate timing measurement by putting a simple System.currentTimeMillis() around these blocks to know how much they are taking and then t

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark colu

Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-03 Thread lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames

columnar structure of RDDs from Parquet or ORC files

2015-06-03 Thread kiran lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames