Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Yikun Jiang Tue, 04 May 2021 00:22:39 -0700

@Saurabh @Mr.Powers Thanks for the input information.

I personal perfer to introduce the `withColumns` because it bring more
friendly development experience rather than select( * ).


This is the PR to add `withColumns`:
https://github.com/apache/spark/pull/32431

Regards,
Yikun


Saurabh Chawla <s.saurabh...@gmail.com> 于2021年4月30日周五 下午1:13写道：

> Hi All,
>
> I also had a scenario where at runtime, I needed to loop through a
> dataframe to use withColumn many times.
>
>  For the safer side I used the reflection to access the withColumns to
> prevent any java.lang.StackOverflowError.
>
> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
> val newConfigurationMethod =
>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], 
> classOf[Seq[Column]])
> newConfigurationMethod.invoke(
>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>
> It would be great if we use the "withColumns" rather than using the
> reflection code like this.
> or
> make changes in the code to merge the project with existing project in the
> plan, instead of adding the new project every time we call the "
> withColumn".
>
> +1 for exposing the *withColumns*
>
> Regards
> Saurabh Chawla
>
> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yikunk...@gmail.com> wrote:
>
>> Hi, all
>>
>> *Background:*
>>
>> Currently, there is a withColumns
>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>> method to help users/devs add/replace multiple columns at once.
>> But this method is private and isn't exposed as a public API interface,
>> that means it cannot be used by the user directly, and also it is not
>> supported in PySpark API.
>>
>> As the dataframe user, I can only call withColumn() multiple times:
>>
>> df.withColumn("key1", col("key1")).withColumn("key2", 
>> col("key2")).withColumn("key3", col("key3"))
>>
>> rather than:
>>
>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
>> col("key3")])
>>
>> Multiple calls bring some higher cost on developer experience and
>> performance. Especially in a PySpark related scenario, multiple calls mean
>> multiple py4j calls.
>>
>> As mentioned
>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
>> @Hyukjin, there were some previous discussions on  SPARK-12225
>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>
>> [1]
>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>
>> *Potential solution:*
>> Looks like there are 2 potential solutions if we want to support it:
>>
>> 1. Introduce a *withColumns *api for Scala/Python.
>> A separate public withColumns API will be added in scala/python api.
>>
>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>> I did some experimental try on PySpark on
>> https://github.com/apache/spark/pull/32276
>> Just like Maciej said
>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>> it will bring some confusion with naming.
>>
>>
>> Thanks for your reading, feel free to reply if you have any other
>> concerns or suggestions!
>>
>>
>> Regards,
>> Yikun
>>
>

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Reply via email to