Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Kazuaki Ishizaki
Looks interesting discussion. Let me describe the current structure and remaining issues. This is orthogonal to cost-benefit trade-off discussion. The code generation basically consists of three parts. 1. Loading 2. Selection (map, filter, ...) 3. Projection 1. Columnar storage (e.g. Parquet, Or

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Bobby Evans
Reynold, >From our experiments, it is not a massive refactoring of the code. Most expressions can be supported by a relatively small change while leaving the existing code path untouched. We didn't try to do columnar with code generation, but I suspect it would be similar, although the code gene

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Abdeali Kothari
Nice, will test it out +1 On Tue, Mar 26, 2019, 22:38 Reynold Xin wrote: > We just made the repo public: https://github.com/databricks/spark-pandas > > > On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter > wrote: > >> To add more details to what Reynold mentioned. As you said, there is >> going

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-26 Thread shane knapp
i'm pretty certain that i've got a solid python 3.5 conda environment ready to be deployed, but this isn't a minor change to the build system and there might be some bugs to iron out. another problem is that the current python 3.4 environment is hard-coded in to the both the build scripts on jenki

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-26 Thread Bryan Cutler
Thanks Hyukjin. The plan is to get this done for 3.0 only. Here is a link to the JIRA https://issues.apache.org/jira/browse/SPARK-27276. Shane is also correct in that newer versions of pyarrow have stopped support for Python 3.4, so we should probably have Jenkins test against 2.7 and 3.5. On M

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-26 Thread Mark Hamstra
Yes, I do expect that the application-level approach outlined in this SPIP will be sufficiently useful to be worth doing despite any concerns about it not being ideal. My concern is not just about this design, however. It feels to me like we are running into limitations of the current Spark schedul

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Reynold Xin
26% improvement is underwhelming if it requires massive refactoring of the codebase. Also you can't just add the benefits up this way, because: - Both vectorization and codegen reduces the overhead in virtual function calls - Vectorization code is more friendly to compilers / CPUs, but requires

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin
We just made the repo public: https://github.com/databricks/spark-pandas On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter < timhun...@databricks.com > wrote: > > To add more details to what Reynold mentioned. As you said, there is going > to be some slight differences in any case between Pandas

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-26 Thread Imran Rashid
+1 on the updated SPIP I agree with all of Mark's concerns, that eventually we want some way for users to express per-task constraints -- but I feel like this is a still a reasonable step forward. In the meantime, users will either write small spark applications, which just do the steps which nee

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Bobby Evans
Cloudera reports a 26% improvement in hive query runtimes by enabling vectorization. I would expect to see similar improvements but at the cost of keeping more data in memory. But remember this also enables a number of different hardware acceleration techniques. If the data format is arrow compat

RE: How to build single jar for single project in spark

2019-03-26 Thread Ajith shetty
You can try using -pl maven option for this > mvn clean install -pl :spark-core_2.11 From:Qiu, Gerry To:zhangliyun ;dev@spark.apache.org Date:2019-03-26 14:34:20 Subject:RE: How to build single jar for single project in spark You can try this https://spark.apache.org/docs/latest/building-spark

RE: How to build single jar for single project in spark

2019-03-26 Thread Qiu, Gerry
You can try this https://spark.apache.org/docs/latest/building-spark.html#building-submodules-individually Thanks, Gerry From: zhangliyun Sent: 2019年3月26日 16:50 To: dev@spark.apache.org Subject: How to build single jar for single project in spark Hi all: I have a question when i modify one

How to build single jar for single project in spark

2019-03-26 Thread zhangliyun
Hi all: I have a question when i modify one file in spark project like org/apache/spark/sql/execution/ui/SparkPlanGraph.scala. Can i only build the single jar spark-core_2.11-2.3.2.jar? After finishing building the single jar then copy the jar to $SPARK_HOME/jars directory. If anyone knows

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Timothee Hunter
To add more details to what Reynold mentioned. As you said, there is going to be some slight differences in any case between Pandas and Spark in any case, simply because Spark needs to know the return types of the functions. In your case, you would need to slightly refactor your apply method to the

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Hyukjin Kwon
BTW, I am working on the documentation related with this subject at https://issues.apache.org/jira/browse/SPARK-26022 to describe the difference 2019년 3월 26일 (화) 오후 3:34, Reynold Xin 님이 작성: > We have some early stuff there but not quite ready to talk about it in > public yet (I hope soon though).