Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Паша Tue, 04 May 2021 06:34:49 -0700

I've created my own implicit withColumnsRenamed for such a purpose which
just accepted a map of string→string and called rename multiple times.


вт, 4 мая 2021 г. в 10:22, Yikun Jiang <yikunk...@gmail.com>:

> @Saurabh @Mr.Powers Thanks for the input information.
>
> I personal perfer to introduce the `withColumns` because it bring more
> friendly development experience rather than select( * ).
>
> This is the PR to add `withColumns`:
> https://github.com/apache/spark/pull/32431
>
> Regards,
> Yikun
>
>
> Saurabh Chawla <s.saurabh...@gmail.com> 于2021年4月30日周五 下午1:13写道：
>
>> Hi All,
>>
>> I also had a scenario where at runtime, I needed to loop through a
>> dataframe to use withColumn many times.
>>
>>  For the safer side I used the reflection to access the withColumns to
>> prevent any java.lang.StackOverflowError.
>>
>> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
>> val newConfigurationMethod =
>>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], 
>> classOf[Seq[Column]])
>> newConfigurationMethod.invoke(
>>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>>
>> It would be great if we use the "withColumns" rather than using the
>> reflection code like this.
>> or
>> make changes in the code to merge the project with existing project in
>> the plan, instead of adding the new project every time we call the "
>> withColumn".
>>
>> +1 for exposing the *withColumns*
>>
>> Regards
>> Saurabh Chawla
>>
>> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yikunk...@gmail.com> wrote:
>>
>>> Hi, all
>>>
>>> *Background:*
>>>
>>> Currently, there is a withColumns
>>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>>> method to help users/devs add/replace multiple columns at once.
>>> But this method is private and isn't exposed as a public API interface,
>>> that means it cannot be used by the user directly, and also it is not
>>> supported in PySpark API.
>>>
>>> As the dataframe user, I can only call withColumn() multiple times:
>>>
>>> df.withColumn("key1", col("key1")).withColumn("key2", 
>>> col("key2")).withColumn("key3", col("key3"))
>>>
>>> rather than:
>>>
>>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
>>> col("key3")])
>>>
>>> Multiple calls bring some higher cost on developer experience and
>>> performance. Especially in a PySpark related scenario, multiple calls mean
>>> multiple py4j calls.
>>>
>>> As mentioned
>>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143>
>>> from @Hyukjin, there were some previous discussions on  SPARK-12225
>>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>>
>>> *Potential solution:*
>>> Looks like there are 2 potential solutions if we want to support it:
>>>
>>> 1. Introduce a *withColumns *api for Scala/Python.
>>> A separate public withColumns API will be added in scala/python api.
>>>
>>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>>> I did some experimental try on PySpark on
>>> https://github.com/apache/spark/pull/32276
>>> Just like Maciej said
>>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>>> it will bring some confusion with naming.
>>>
>>>
>>> Thanks for your reading, feel free to reply if you have any other
>>> concerns or suggestions!
>>>
>>>
>>> Regards,
>>> Yikun
>>>
>>

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Reply via email to