s code for element-wise
>> selection (excluding sort and join). The SIMDzation or GPUization
>> capability depends on a compiler that translates native code from the code
>> generated by the whole-stage codegen.
>>
>> 3. The current Projection assume to store row-oriented data,
t;>>>>
>>>>>>> We split it this way because we thought it would be simplest to
>>>>>>> implement,
>>>>>>> and because it would provide a benefit to more than just GPU accelerated
>>>>>>> queries.
>>&
the current structure and remaining issues. This is
>>>>>> orthogonal to cost-benefit trade-off discussion.
>>>>>>
>>>>>> The code generation basically consists of three parts.
>>>>>> 1. Loading
>>>>>> 2. Sele
ColumnVector (
>>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java)
>>>>> class. By combining with ColumnarBatchScan, the whole-stage code
>>>>> generation
>>>>> generate code
rage if there is
>>>> no row-based operation.
>>>> Note: The current master does not support Arrow as a data source.
>>>> However, I think it is not technically hard to support Arrow.
>>>>
>>>> 2. The current whole-stage codegen generates
gt;>> 2. The current whole-stage codegen generates code for element-wise
>>> selection (excluding sort and join). The SIMDzation or GPUization
>>> capability depends on a compiler that translates native code from the code
>>> generated by the whole-stage codegen.
&
store row-oriented data, I think that
>> is a part that Wenchen pointed out
>>
>> My slides
>> https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use/41
>> <https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use>ma
give a presentation about in-memory data storages for SPark at
>> SAIS 2019
>> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40
>> (
>> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40
>> ) :)
>>
>>
> :)
>
> Kazuaki Ishizaki
>
>
>
> From:Wenchen Fan
> To:Bobby Evans
> Cc:Spark dev list
> Date:2019/03/26 13:53
> Subject:Re: [DISCUSS] Spark Columnar Processing
> --
>
>
>
> Do y
list
Date: 2019/03/26 13:53
Subject:Re: [DISCUSS] Spark Columnar Processing
Do you have some initial perf numbers? It seems fine to me to remain
row-based inside Spark with whole-stage-codegen, and convert rows to
columnar batches when communicating with external systems.
On Mon, Mar
Reynold,
>From our experiments, it is not a massive refactoring of the code. Most
expressions can be supported by a relatively small change while leaving the
existing code path untouched. We didn't try to do columnar with code
generation, but I suspect it would be similar, although the code gene
26% improvement is underwhelming if it requires massive refactoring of the
codebase. Also you can't just add the benefits up this way, because:
- Both vectorization and codegen reduces the overhead in virtual function calls
- Vectorization code is more friendly to compilers / CPUs, but requires
Cloudera reports a 26% improvement in hive query runtimes by enabling
vectorization. I would expect to see similar improvements but at the cost
of keeping more data in memory. But remember this also enables a number of
different hardware acceleration techniques. If the data format is arrow
compat
Do you have some initial perf numbers? It seems fine to me to remain
row-based inside Spark with whole-stage-codegen, and convert rows to
columnar batches when communicating with external systems.
On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote:
> This thread is to discuss adding in support fo
14 matches
Mail list logo