Re: [DISCUSS] Spark Columnar Processing

2019-04-13 Thread Bobby Evans
s code for element-wise >> selection (excluding sort and join). The SIMDzation or GPUization >> capability depends on a compiler that translates native code from the code >> generated by the whole-stage codegen. >> >> 3. The current Projection assume to store row-oriented data,

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Reynold Xin
t;>>>> >>>>>>> We split it this way because we thought it would be simplest to >>>>>>> implement, >>>>>>> and because it would provide a benefit to more than just GPU accelerated >>>>>>> queries. >>&

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Bobby Evans
the current structure and remaining issues. This is >>>>>> orthogonal to cost-benefit trade-off discussion. >>>>>> >>>>>> The code generation basically consists of three parts. >>>>>> 1. Loading >>>>>> 2. Sele

Re: [DISCUSS] Spark Columnar Processing

2019-04-05 Thread Bobby Evans
ColumnVector ( >>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java) >>>>> class. By combining with ColumnarBatchScan, the whole-stage code >>>>> generation >>>>> generate code

Re: [DISCUSS] Spark Columnar Processing

2019-04-03 Thread Bobby Evans
rage if there is >>>> no row-based operation. >>>> Note: The current master does not support Arrow as a data source. >>>> However, I think it is not technically hard to support Arrow. >>>> >>>> 2. The current whole-stage codegen generates

Re: [DISCUSS] Spark Columnar Processing

2019-04-02 Thread Renjie Liu
gt;>> 2. The current whole-stage codegen generates code for element-wise >>> selection (excluding sort and join). The SIMDzation or GPUization >>> capability depends on a compiler that translates native code from the code >>> generated by the whole-stage codegen. &

Re: [DISCUSS] Spark Columnar Processing

2019-04-02 Thread Bobby Evans
store row-oriented data, I think that >> is a part that Wenchen pointed out >> >> My slides >> https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use/41 >> <https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use>ma

Re: [DISCUSS] Spark Columnar Processing

2019-04-01 Thread Reynold Xin
give a presentation about in-memory data storages for SPark at >> SAIS 2019 >> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40 >> ( >> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40 >> ) :) >> >>

Re: [DISCUSS] Spark Columnar Processing

2019-03-27 Thread Bobby Evans
> :) > > Kazuaki Ishizaki > > > > From:Wenchen Fan > To:Bobby Evans > Cc:Spark dev list > Date:2019/03/26 13:53 > Subject:Re: [DISCUSS] Spark Columnar Processing > -- > > > > Do y

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Kazuaki Ishizaki
list Date: 2019/03/26 13:53 Subject:Re: [DISCUSS] Spark Columnar Processing Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Bobby Evans
Reynold, >From our experiments, it is not a massive refactoring of the code. Most expressions can be supported by a relatively small change while leaving the existing code path untouched. We didn't try to do columnar with code generation, but I suspect it would be similar, although the code gene

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Reynold Xin
26% improvement is underwhelming if it requires massive refactoring of the codebase. Also you can't just add the benefits up this way, because: - Both vectorization and codegen reduces the overhead in virtual function calls - Vectorization code is more friendly to compilers / CPUs, but requires

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Bobby Evans
Cloudera reports a 26% improvement in hive query runtimes by enabling vectorization. I would expect to see similar improvements but at the cost of keeping more data in memory. But remember this also enables a number of different hardware acceleration techniques. If the data format is arrow compat

Re: [DISCUSS] Spark Columnar Processing

2019-03-25 Thread Wenchen Fan
Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote: > This thread is to discuss adding in support fo