Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
The original email was asking about data partitioning (Hive style) for files, not in memory caching. On Thursday, January 21, 2016, Takeshi Yamamuro wrote: > You mean RDD#partitions are possibly split into multiple Spark task > partitions? > If so, the optimization below is wrong? > > Without op

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Takeshi Yamamuro
You mean RDD#partitions are possibly split into multiple Spark task partitions? If so, the optimization below is wrong? Without opt.: == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], outp

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition. On Thu, Jan 21, 2016 at 3

Re: How Spark utilize low-level architecture features?

2016-01-21 Thread borictan
Thanks, Fokko. Yes, increasing the parallelism is one way to speed up the performance. On the other hand, we are also looking for opportunities to harness the single node hardware power to increase the single-node performance, which will help the overall performance. Thanks, Boric -- View thi

Re: How Spark utilize low-level architecture features?

2016-01-21 Thread borictan
Thanks for the explanation, Steve. I don't want to control where the work is done. What I wanted to understand is if Spark could take advantage of the underlying architecture features. For example, if the CPUs on the nodes support some improved vector instructions, can the Spark jobs (if they hav

Generate Amplab queries set

2016-01-21 Thread sara mustafa
Hi, I have downloaded the Amplab benchmark dataset from s3n://big-data-benchmark/pavlo/text/tiny, but I don't know how to generate a set of random mixed queries of different types like scan,aggregate and join. Thanks, -- View this message in context: http://apache-spark-developers-list.100155

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Ulanov, Alexander
Hi Kazuaki, Indeed, moving data to/from GPU is costly and this benchmark summarizes the costs for moving different data sizes with regards to matrices multiplication. These costs are paid for the convenience of using the standard BLAS API that Nvidia NVBLAS provides. The thing is that there are

Re: How Spark utilize low-level architecture features?

2016-01-21 Thread Steve Loughran
> On 19 Jan 2016, at 16:12, Boric Tan wrote: > > Hi there, > > I am new to Spark, and would like to get some help to understand if Spark can > utilize the underlying architectures for better performance. If so, how does > it do it? > > For example, assume there is a cluster built with machin

Re: How Spark utilize low-level architecture features?

2016-01-21 Thread Driesprong, Fokko
Hi Boric, For the Spark Mllib package, which is build on top of Breeze , which uses in turn netlib-java . This netlib-java library can be optimized for each system by compiling the specific architecture: *To get optimal pe

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Allen Zhang
Hi Kazuaki, Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows that 3.15x performance boost of logistic regression seems slower than BIDMat-cublas or pure CUDA. Could you elaborate on why you chose Jcuda other then JNI to call CUDA directly? Regards, Allen Zhang

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Kazuaki Ishizaki
Dear all, Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be

Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Justin Uang
Hi, If I had a df and I wrote it out via partitionBy("id"), presumably, when I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary right? Effectively, we can load in the dataframe with a hash partitioner already set, since each task can simply read all the folders where id= whe